How chunking math works
With chunk size C and overlap O, you advance through the corpus by a stride of C − O. For a corpus of T tokens, the number of chunks is approximately ceil((T − O) / (C − O)). The first chunk uses tokens 0 to C; the second uses tokens C − O to 2C − O; and so on.
Embedding cost scales with embed tokens — chunks × chunk size — not corpus tokens. With 50% overlap you embed roughly 2× the corpus. With no overlap, exactly 1×. Pick overlap deliberately.
Real splitters (LangChain, LlamaIndex) chunk at sentence or paragraph boundaries, so chunks are slightly smaller than the target. Treat this estimator's number as an upper bound; reality is within a few percent.
FAQ
- How does the math work?
- Stride = chunk_size − overlap. For total T tokens: chunks ≈ ceil((T − overlap) / stride). Embed tokens = chunks × chunk_size, since each chunk re-embeds its overlap with neighbours.
- What chunk size should I use?
- 300–800 tokens is a reasonable default. Smaller chunks improve recall on narrow questions; larger chunks preserve more context per match. Pair with reranking if you go small.
- How much overlap is right?
- 10–20% of chunk size is a common starting point (e.g., 50–100 tokens for 500-token chunks). Larger overlap costs more to embed and store but reduces "answer split across boundary" failures.
- Why does the embed-token total exceed the corpus total?
- Overlap. Each overlapping region gets embedded in two neighbouring chunks. Storage and embedding cost are proportional to embed tokens, not corpus tokens.
- Does the tool match my embedding model's tokenizer exactly?
- Close enough for sizing. We use o200k tokenizer; embedding-3 family uses cl100k. The chunk count differs by a few percent in either direction. For exact byte-precise sizing, run your real splitter and count.
Common pitfalls
- Setting overlap ≥ chunk size — the splitter never advances and chunks balloon to infinity.
- Using a chunk size larger than your embedding model's max input (8191 for OpenAI). The API rejects or truncates silently.
- Re-embedding the entire corpus on every minor change. Use content-hashed chunks so unchanged chunks aren't re-embedded.
- Forgetting to budget for re-indexing. If you tweak chunk size after launch, your full corpus needs re-embedding.
In your code
npm i @langchain/textsplitters import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 50,
});
const chunks = await splitter.splitText(longText);
console.log(chunks.length); pip install llama-index-core from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=500, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(docs)
print(len(nodes)) Related tools
- Token Counter
Count tokens for GPT-4o, GPT-4, GPT-3.5 and more — tokenizer-exact, runs in your browser.
- Multi-model Token Compare
Side-by-side token counts for the same input across GPT-4o, GPT-4 Turbo, GPT-3.5, Claude, Gemini.
- Context Window Calculator
Pick a model, paste your prompt, see how much context you have left after reserving output tokens.
- Embedding Dimension Reference
Reference table of embedding model output dimensions, max input tokens, and pricing.