AI Text Summarizer
100% private AI text summarizer that runs on-device in your browser. No upload, no sign-up, no API key. Summarize long articles and documents offline.
About AI Text Summarizer
Our AI Text Summarizer runs a real machine-learning model entirely inside your browser. The abstractive mode loads Xenova/distilbart-cnn-6-6 (an ONNX, INT8-quantized BART distillation) via Transformers.js and generates the summary on your own device using WebGPU, with an automatic WebAssembly fallback. Your text is never uploaded to any server, there is no API key, and no account is required.
The extractive mode is a fast, fully local heuristic that scores each sentence by position, length, and keyword presence and returns the highest-ranking sentences unchanged. The abstractive mode instead reads the whole text and rewrites it in new words, like a person would.
Because inference is on-device, the tool is suitable for confidential drafts, legal briefs, medical notes, and internal reports you cannot legally paste into a cloud API. See also our AI Grammar and Paraphraser and AI keyword extractor.
Does the AI summarizer run entirely in my browser?
Yes. The abstractive model is downloaded once via Transformers.js and then runs locally inside your browser tab using WebGPU or WebAssembly. After the initial model download (cached by the browser in IndexedDB for future visits), there is no network round-trip per summary — every token of your input text and every word of the generated summary stays on your device. We never see your documents, and no server-side log is created. This makes the tool safe for confidential drafts, internal reports, legal briefs, medical notes, or any text you cannot legally upload to a third-party API like OpenAI or Anthropic. The tradeoff is the initial model download (about 60 MB for the default INT8 model).
Which model powers the abstractive summaries?
The default abstractive model is Xenova/distilbart-cnn-6-6 — an ONNX, INT8-quantized distillation of Facebook's BART-large-CNN. BART is an encoder-decoder transformer: a bidirectional encoder reads the full source, then a left-to-right decoder generates an abstractive rewrite. The 6-6 DistilBART variant keeps close to BART-large quality on news-style text while being roughly 60 MB and several times faster to load and run, which matters a lot for in-browser inference. It is loaded and executed through Transformers.js (ONNX Runtime Web).
What text formats and lengths are supported?
You can paste plain text, Markdown, or content copied from PDF, Word, web articles, or email. The model accepts UTF-8 text and works best on English (the CNN/DailyMail training domain). Practical input length per pass is bounded by the model's context window — about 1024 tokens, roughly 700 English words. For longer documents the tool chunks the input into overlapping ~700-word windows, summarizes each chunk, then summarizes the concatenation (recursive/hierarchical summarization). Very long inputs (>20 pages) may take 30-60 seconds.

Why does the first summary take so long but the next one is fast?
The first run has to download the model weights (about 60 MB for the default INT8 model), parse them, build the compute graph, and compile the kernels for your CPU or GPU. This cold start can take 10-40 seconds on a typical desktop and longer on mobile. Once loaded, the weights live in browser memory and in the IndexedDB cache, so subsequent summaries reuse the same compiled model and complete in 1-5 seconds for short passages. If you close the tab the in-memory weights are released, but the IndexedDB cache survives, so the next visit only needs to recompile, not re-download.
How accurate are AI summaries, and can they hallucinate?
DistilBART-CNN reaches ROUGE-L scores in roughly the high-30s to low-40s on the CNN/DailyMail benchmark — competitive with non-expert human summarizers on news-style content, and a few points below the full BART-large it was distilled from. Quality drops on highly technical, domain-specific, or narrative texts the model was not trained on. Like all abstractive models it can hallucinate — introduce facts not in the source — so always verify numbers, names, and quotes against the original before publishing. For exact-fidelity needs, use the extractive mode, which only reorders your own sentences.
Is WebGPU faster than WebAssembly for summarization?
Yes, often dramatically so. WebGPU offloads the matrix multiplications that dominate transformer inference to your discrete or integrated GPU, yielding multi-times speedups versus the SIMD-WebAssembly CPU backend. This tool tries WebGPU first and shows a 'Running on WebGPU' badge when it succeeds. WebGPU needs a recent browser (Chrome 113+, Edge, Safari 18+, recent Firefox) and a compatible GPU driver. If WebGPU is unavailable, the tool automatically falls back to WebAssembly with SIMD and multithreading (the 'Running on CPU (WASM)' badge) — slower but works on every modern browser, and still fully on-device.
What is INT8 quantization and does it hurt quality?
Quantization stores each model weight as an 8-bit integer (256 possible values) instead of a 32-bit float. It cuts the download size by about 4x and speeds up CPU inference by 2-4x because INT8 arithmetic uses fewer cycles and packs more values per SIMD register. For summarization, INT8 typically costs only 1-3 ROUGE points versus FP32 — usually invisible in the output. That is why we ship the INT8 ONNX build of distilbart-cnn-6-6 by default: about 60 MB to download, fast to run, and easy to cache for repeat use. ONNX Runtime Web handles INT8 dequantization on the fly.
