Promptyard
Tokens

RAG Chunk Estimator

Estimate chunk count and embedding token spend from chunk size + overlap + corpus size.

Tokenizer counts your sample.
Total tokens
0
Chunks
0
Embed tokens (with overlap)
0
Last chunk size
0

With overlap, you re-embed the overlap region in each adjacent chunk — that's why "Embed tokens" exceeds total tokens. Cost scales with the embed-tokens column.

How chunking math works

With chunk size C and overlap O, you advance through the corpus by a stride of C − O. For a corpus of T tokens, the number of chunks is approximately ceil((T − O) / (C − O)). The first chunk uses tokens 0 to C; the second uses tokens C − O to 2C − O; and so on.

Embedding cost scales with embed tokens — chunks × chunk size — not corpus tokens. With 50% overlap you embed roughly 2× the corpus. With no overlap, exactly 1×. Pick overlap deliberately.

Real splitters (LangChain, LlamaIndex) chunk at sentence or paragraph boundaries, so chunks are slightly smaller than the target. Treat this estimator's number as an upper bound; reality is within a few percent.

FAQ

How does the math work?
Stride = chunk_size − overlap. For total T tokens: chunks ≈ ceil((T − overlap) / stride). Embed tokens = chunks × chunk_size, since each chunk re-embeds its overlap with neighbours.
What chunk size should I use?
300–800 tokens is a reasonable default. Smaller chunks improve recall on narrow questions; larger chunks preserve more context per match. Pair with reranking if you go small.
How much overlap is right?
10–20% of chunk size is a common starting point (e.g., 50–100 tokens for 500-token chunks). Larger overlap costs more to embed and store but reduces "answer split across boundary" failures.
Why does the embed-token total exceed the corpus total?
Overlap. Each overlapping region gets embedded in two neighbouring chunks. Storage and embedding cost are proportional to embed tokens, not corpus tokens.
Does the tool match my embedding model's tokenizer exactly?
Close enough for sizing. We use o200k tokenizer; embedding-3 family uses cl100k. The chunk count differs by a few percent in either direction. For exact byte-precise sizing, run your real splitter and count.

Common pitfalls

  • Setting overlap ≥ chunk size — the splitter never advances and chunks balloon to infinity.
  • Using a chunk size larger than your embedding model's max input (8191 for OpenAI). The API rejects or truncates silently.
  • Re-embedding the entire corpus on every minor change. Use content-hashed chunks so unchanged chunks aren't re-embedded.
  • Forgetting to budget for re-indexing. If you tweak chunk size after launch, your full corpus needs re-embedding.

In your code

LangChain RecursiveCharacterTextSplitter js npm →
npm i @langchain/textsplitters
import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 500,
  chunkOverlap: 50,
});
const chunks = await splitter.splitText(longText);
console.log(chunks.length);
LlamaIndex SentenceSplitter python pypi →
pip install llama-index-core
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=500, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(docs)
print(len(nodes))

Related tools