RAG Chunk Estimator

How chunking math works

With chunk size C and overlap O, you advance through the corpus by a stride of C − O. For a corpus of T tokens, the number of chunks is approximately ceil((T − O) / (C − O)). The first chunk uses tokens 0 to C; the second uses tokens C − O to 2C − O; and so on.

Embedding cost scales with embed tokens — chunks × chunk size — not corpus tokens. With 50% overlap you embed roughly 2× the corpus. With no overlap, exactly 1×. Pick overlap deliberately.

Real splitters (LangChain, LlamaIndex) chunk at sentence or paragraph boundaries, so chunks are slightly smaller than the target. Treat this estimator's number as an upper bound; reality is within a few percent.

FAQ

How does the math work?: Stride = chunk_size − overlap. For total T tokens: chunks ≈ ceil((T − overlap) / stride). Embed tokens = chunks × chunk_size, since each chunk re-embeds its overlap with neighbours.
What chunk size should I use?: 300–800 tokens is a reasonable default. Smaller chunks improve recall on narrow questions; larger chunks preserve more context per match. Pair with reranking if you go small.
How much overlap is right?: 10–20% of chunk size is a common starting point (e.g., 50–100 tokens for 500-token chunks). Larger overlap costs more to embed and store but reduces "answer split across boundary" failures.
Why does the embed-token total exceed the corpus total?: Overlap. Each overlapping region gets embedded in two neighbouring chunks. Storage and embedding cost are proportional to embed tokens, not corpus tokens.
Does the tool match my embedding model's tokenizer exactly?: Close enough for sizing. We use o200k tokenizer; embedding-3 family uses cl100k. The chunk count differs by a few percent in either direction. For exact byte-precise sizing, run your real splitter and count.

Common pitfalls

Setting overlap ≥ chunk size — the splitter never advances and chunks balloon to infinity.
Using a chunk size larger than your embedding model's max input (8191 for OpenAI). The API rejects or truncates silently.
Re-embedding the entire corpus on every minor change. Use content-hashed chunks so unchanged chunks aren't re-embedded.
Forgetting to budget for re-indexing. If you tweak chunk size after launch, your full corpus needs re-embedding.

In your code

LangChain RecursiveCharacterTextSplitter js npm →

npm i @langchain/textsplitters

import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 500,
  chunkOverlap: 50,
});
const chunks = await splitter.splitText(longText);
console.log(chunks.length);

LlamaIndex SentenceSplitter python pypi →

pip install llama-index-core

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=500, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(docs)
print(len(nodes))

How chunking math works

FAQ

Common pitfalls

In your code

Related tools