AI Context Bundler

Bypass the context window: split long code, docs or transcripts into model-aware numbered chunks for Claude, GPT, Gemini or Llama, then download all at once.

Have feedback? Report bugs, suggest features, or share your thoughts — we read them all

What is the AI Context Bundler?

When you need an LLM to reason over a long document, codebase or transcript that exceeds your chat window, you have two choices: upgrade your model or split the input into context-aware chunks and feed them sequentially. This tool does the second — fast, free and in your browser. Paste or upload text, pick your target model (Claude, GPT-4o, GPT-5, Gemini, Llama or a custom limit), and the bundler emits numbered '## Chunk i of N' blocks sized to fit comfortably under the model's context window. You can choose smart paragraph-aware splitting, hard character cuts, or anything in between, plus configure overlap so successive chunks share context.

Key Features

Presets for Claude (200K & 1M), GPT-4o (128K), GPT-5 (256K), Gemini 2.5 (2M), Llama 3.3 (128K)
Custom token limit for any other model or local Llama/Mistral deployment
Smart splitter that respects markdown headings, then paragraphs, then lines, before falling back to hard cuts
Configurable overlap (0-50%) so consecutive chunks share trailing context — improves coherence in summarization tasks
Live token estimate (~3.7 chars/token, the documented OpenAI heuristic accurate to ±10% for code and English)
Input cost preview using current public per-million-token pricing
One-click copy per chunk with auto-generated '## Chunk i of N' markdown header
Export all chunks at once: combined .md (with preamble) or one .txt per chunk (chunk-01.txt…) for scripts and pipelines
Load up to 50MB from a local file — txt, md, json, csv, log, html, css, js, ts, py, go and more

AI Context Bundler — Bypass the context window: split long code, docs or transcripts into model-aware numbered chunks for Claude, GPT, Gemini — **AI Context Bundler**

How to Use

Paste your long text into the source box (or click Load File to upload from disk)
Pick the target model — chunk size defaults to 25% of the model's max context
Adjust chunk size if you want smaller, more focused prompts (smaller chunks = more turns but better recall)
Set overlap to 5-15% for prose, 0% for code (overlap can confuse the model on structured input)
Pick a split strategy — Smart works for 95% of inputs; use Lines for log files, Paragraphs for prose
Click Bundle Into Chunks, then copy each one in order and paste into your model with brief context

Frequently Asked Questions

Each model uses a different tokenizer: GPT-4/5 use cl100k_base, GPT-3.5 used p50k, Claude uses Anthropic's proprietary tokenizer (similar to but not identical to GPT-4's), Gemini uses SentencePiece, and Llama 3 uses its own 128K vocab. Running every tokenizer client-side would mean shipping 5+ MB of WebAssembly. The ~3.7 chars/token heuristic is what OpenAI publishes on the docs and is accurate to within ±10% for English text and typical code — good enough for chunk-size planning where you usually leave 10-20% headroom anyway. For exact counts before billing, use OpenAI's tiktoken or Anthropic's tokenizer SDK.

Rule of thumb: 0% for code or structured data (XML, JSON, CSV), 5-10% for technical docs, 15-25% for prose, transcripts and meeting notes. Overlap helps the model maintain continuity across chunk boundaries — without it, a sentence that gets cut mid-thought may confuse the model. But too much overlap costs tokens AND tells the model contradictory things if it sees the same passage twice with different surrounding context. 10% is a sane default for general use.

Best practice: send a 'system' message first describing what's coming, then chunks in order. Example: 'I'm going to send you a long codebase split into 8 chunks. Read each one and only respond with OK after each. When I say DONE, summarize the architecture.' Then paste each chunk verbatim (the ## Chunk i of N header tells the model where it is). After the last chunk, send your actual question. This avoids the model summarizing each chunk individually and losing the bigger picture.

Mostly yes. It splits on markdown headings (#, ##, ###) first, then blank-line paragraphs, then single lines, then hard character cuts only as a last resort. Code fenced in ``` won't be split mid-block unless a single block exceeds the chunk size — in which case it falls through to line-by-line splitting. For very long single functions, consider preprocessing your code with a tool like ts-prune or astgrep to extract relevant subgraphs before bundling.

The cost shown is INPUT cost only — sending your text into the model once. It does NOT include: (1) the model's output tokens (typically 3-10× more expensive than input), (2) repeated sends if you re-send chunks for follow-up questions, (3) prompt caching discounts (Claude and OpenAI now offer 50-90% discount on cached prefix), or (4) batch API discounts (50% off if you can wait 24h). For accurate billing, multiply the shown cost by ~2× as a rough total estimate, then check your provider dashboard.

Yes, but with caveats. For embeddings (text-embedding-3, voyage-3, etc.) chunk sizes are typically 500-1500 tokens — much smaller than chat-context bundling. Set chunk size to 1000 and overlap to 100 (10%) for a standard RAG pipeline. The smart paragraph-aware splitter is well-suited because RAG retrieval works best when each chunk represents a coherent semantic unit. Just remember that embedding models have their own context limits (8K-32K) which are usually larger than the per-chunk size you actually want.

50MB raw text via the file picker, which is roughly 13 million tokens — far beyond any current model's context. The browser handles up to ~100MB of text in a textarea without freezing on modern hardware. If you have larger inputs (entire codebases, multi-GB log files), preprocess with grep/ripgrep or a server-side script to extract the relevant slice before loading. Modern editors (VS Code, JetBrains) can also export folded/filtered views which work well as input to this tool.

After bundling, use the two download buttons above the chunk list. 'Download all (.md)' produces a single context-bundle.md containing a short instruction preamble plus every '## Chunk i of N' block in order — directly pasteable into a chat or feedable to a script. 'Download each (.txt)' saves chunk-01.txt, chunk-02.txt, … one file per chunk, which plug straight into file-based ingestion, batch API loops, or version control. This replaces clicking 'Copy with header' 20+ times for large codebases or transcripts and preserves ordering, since the filenames are zero-padded and sequential.

This is the 'lost-in-the-middle' problem: LLMs recall information at the start and end of a long context far better than the middle, and recall degrades as the number of chunks grows. Practical guidance: keep total chunks under ~10-15 for a single reasoning pass; for more than that, ask the model to summarize each chunk into a running outline rather than holding everything verbatim. Put the most important material first or last, and after the final chunk restate your actual question so it sits at the end of context. For 30-50 chunk corpora, a retrieval (RAG) approach that fetches only the relevant chunks beats stuffing them all in at once.

The 3.7 chars/token heuristic is tuned to OpenAI's cl100k_base and is most accurate for GPT-4/5 on English and code (within ±10%). Claude's tokenizer is close to cl100k, so estimates track within roughly ±10-12%. Gemini's SentencePiece tends to be slightly more efficient on prose, so this tool may modestly overcount its tokens (you get a safety margin). Llama 3's 128K vocab is also efficient on code and English, usually within ±15%. In all cases the estimate is conservative enough for chunk planning where you leave 10-20% headroom; for exact billing use tiktoken (OpenAI) or the official Anthropic/Google token-count endpoints.

AI Context Bundler

What is the AI Context Bundler?

Key Features

How to Use

Frequently Asked Questions

Why estimate tokens instead of actually counting them?

What overlap percentage should I use?

How should I prompt the model when sending multiple chunks?

Does the smart splitter preserve code blocks?

Why is the cost estimate so low / high?

Can I use this for embedding/RAG ingestion?

What's the maximum file size I can load?

How do I export every chunk at once to feed a script or pipeline?

How many chunks before the model forgets the earlier ones?

Is the 3.7 chars/token estimate accurate across Claude, Gemini and Llama?