Prompt Token Budget Planner
Estimate token usage for Claude, GPT, Gemini, Llama and split long prompts into model-safe chunks with overlap. Free, offline, no API key needed.
About the Prompt Token Budget Planner
Long prompts silently get truncated, retrieval-augmented generation pipelines fail mid-stream, and model bills explode — almost always because the prompt budget was never planned. This planner gives you a no-API-key, browser-only estimate of how many tokens your text will consume on Claude Opus/Sonnet, Claude 1M, GPT-4o, GPT-5/o3, Gemini 2.x, Llama 3.1, Mistral Large or any custom limit, then splits the text into model-safe chunks with a configurable overlap window so adjacent chunks share context.
Reserve tokens for the response, system prompt and tool schemas, then choose paragraph, sentence or hard char splitting. Each chunk is copyable individually for use in retrieval pipelines, batch summarization, or sequential conversations.
How accurate is the token estimate compared to the real tokenizer from Anthropic, OpenAI or Google?
Our estimate is a planning heuristic, typically within 5-15% of the true count for English prose, 10-20% for source code and 15-25% for CJK languages. We deliberately do not load tokenizer libraries (tiktoken, anthropic-tokenizer, gemini tokenizer) because they add 2-15 MB of WASM and require server calls for some models. The estimator uses the well-documented rule of thumb 1 token approx 4 English characters approx 0.75 words, then refines per text type: CJK characters are mostly 1 token each (so we use ~1.5 chars/token), code is more punctuation-heavy (~3.5 chars/token), and mixed/markdown blends the two. For billing-exact counts always use the model provider's tokenizer endpoint; for planning whether a 240k-token doc fits in a 200k window, this tool gets you the right answer.
Why split with overlap instead of clean breaks at paragraph boundaries?
Without overlap, a question or fact mentioned at the end of chunk 1 has no visible answer in chunk 2 even if the answer is in chunk 2's body, because the model in chunk 2 lacks the question's context. A 5-15% overlap (we default to 10%) repeats the tail of chunk N as the head of chunk N+1, preserving coreference, ongoing arguments, mid-list items and table headers across boundaries. For pure retrieval RAG, 10% is usually right; for legal or scientific summarization where multi-paragraph reasoning is common, raise to 20-25%; for short-form chat or FAQ extraction where each chunk is independent, you can drop to 0%.
What should I put in 'reserved for output' and why does it matter?
Every modern LLM API consumes input + output from the same context window. If your model has a 200,000-token context and you fill 199,000 with the prompt, the model can only generate 1,000 tokens before truncation — often mid-sentence. Reserve at least max_tokens (the value you'll pass on the API call) plus a safety buffer. Practical defaults: 4,096 for normal chat, 8,192 for long-form summarization, 16,384-32,768 for code generation, and 64,000+ for reasoning models like o3/o1 that consume large amounts of hidden thinking tokens internally before producing visible output. Claude's extended thinking and Gemini's thinking mode also consume reserved-output budget invisibly — bump the reservation by 30-50% if you enable those features.
Do system prompts, tool/function schemas and uploaded files count toward the context limit?
Yes — every input the API receives counts against the same total context window. A typical agent stack burns 2,000-8,000 tokens before user input arrives: a 500-2,000 token system prompt, plus tool/function schemas (each JSON schema is roughly 50-300 tokens, and agents commonly expose 5-20 tools = 1,000-6,000 tokens), plus any uploaded PDFs/images already converted to text. Set 'system prompt tokens' and 'tool tokens' honestly — if the planner shows you have 195k tokens available out of a 200k model, that's the real budget after agent overhead. Multimodal inputs are billed separately by tile/patch (Claude/GPT charge per image tile, Gemini per video second) — those are not included in this estimate.

When should I pick Claude 1M vs Gemini 2.x 1M vs splitting into chunks on a smaller model?
1M-token models look magical but have three real costs: (1) latency — first token can take 30-90 seconds at 800k+ input; (2) price — input tokens are billed even when the model only attends to a fraction; and (3) recall degradation — most 1M models show measurable accuracy drops past ~400k tokens, especially for mid-document retrieval ('lost in the middle'). Decision rule: if your full document is under 150k tokens, use a standard 200k model (cheaper, faster, more reliable). 150k-500k with strong long-range reasoning needs: use 1M models native. 500k+ or repeatable production workloads: chunk and use a smaller model with retrieval. This planner shows you the chunk count for any combination — if a 1.5M-token corpus splits into 8 chunks on a 200k model with 10% overlap, parallel processing on the smaller model usually beats one 1M call on cost and total wall-clock.
How do I use the chunks for retrieval-augmented generation (RAG) or batch summarization?
Three common patterns. (1) Map-reduce summarization: send each chunk separately with the same prompt ('summarize this section'), collect outputs, then send all summaries as a single second-pass prompt to merge. (2) Retrieval-augmented: embed each chunk with an embedding model (text-embedding-3-small, voyage-3, gemini-embedding), store in a vector DB (Qdrant, pgvector, Pinecone), retrieve top-K at query time. For embeddings, keep chunks at 200-800 tokens — much smaller than this planner's defaults; set 'context window' to your embedding model's input limit (8192 for OpenAI, 32k for Voyage). (3) Sequential conversation: feed chunks one by one in a multi-turn dialogue, asking the model to remember key facts. Overlap matters most for pattern 1 and 3; for pattern 2 the retrieval system handles continuity, so 0% overlap with small chunks works.
Why are CJK and code estimates different from English?
Tokenizers like BPE and SentencePiece split common substrings into single tokens. In English, 'the', 'and', 'tion' all become one token but rarer words split into 2-4 tokens, averaging roughly 4 characters per token. Chinese, Japanese and Korean text is mostly individual characters or 2-char compounds — Anthropic, OpenAI and Google tokenizers map most CJK characters to single tokens, giving ~1.5 chars/token (CJK characters are wider so the byte-to-char ratio shifts too). Source code is dense with single-char punctuation ({}, [], (), ;, :, .) and short identifiers that each consume a token, plus indentation whitespace — ~3.5 chars/token. Use the 'mixed' option for markdown documents, JSON config files, or technical writing that interleaves prose and code blocks.
Does this tool work offline and is my prompt sent anywhere?
Fully offline after the page loads. All tokenization estimation, splitting and overlap calculation runs in your browser via plain JavaScript — no API calls, no telemetry, no server upload. You can disconnect the network after loading and the tool keeps working. We chose heuristic estimation specifically so we never need to send your prompt to a remote tokenizer service. For sensitive content (legal contracts, medical records, source code, internal docs), this tool is safe to use. The only data leaving your browser is standard page-view analytics if you have not opted out via the site privacy controls.
