How it works
Every chat completion has a fixed budget: context window. That budget is split between everything you send (system prompt + history + tools + RAG passages + the current user message) and everything the model sends back. If input + reserved output exceeds the limit, the API rejects the call or silently truncates.
For OpenAI models, this tool counts your input exactly via js-tiktoken. For Anthropic and Google, we approximate using OpenAI's tokenizer because their tokenizers aren't publicly distributed in JS — close enough for sanity checks, not for billing-precise estimates.
The bar shows three regions: input (cyan, what you've used), reserved output (paper grey, what you've held back), and free space. If the bar overflows, you see how many tokens to trim.
FAQ
- Why reserve output tokens up front?
- The model's output competes with your input for the same context budget. If you cram input right up to the limit, the response gets truncated mid-sentence. Reserving 2–8k for output is standard.
- Why are non-OpenAI models marked "approximate"?
- Anthropic and Google don't ship a JS tokenizer that exactly matches their server-side BPE. We use OpenAI's as a stand-in (≈10% off in practice). For production sizing, call the provider's own
count_tokensendpoint. - Does the system prompt count?
- Yes. System prompt + conversation history + retrieved context + the user's current message all share the input budget. Forget the system prompt at your peril — it's often hundreds of tokens.
- Why does Gemini 2 Pro have a 2M context but the API caps shorter?
- Some context-window claims are nominal — long-context models often have throughput, latency, or pricing surcharges past a certain length. Read the provider's docs for "actual usable" limits.
Common pitfalls
- Letting the model decide output length on a near-full input — outputs get truncated mid-sentence.
- Forgetting to count function-calling schemas. Tool definitions live in the input budget too.
- Sending PDF / image bytes as base64 in the prompt and not realising each kilobyte ≈ 200 tokens.
- Treating Gemini's 2M context as free real-estate. Latency and price scale with usage.
In your code
npm i js-tiktoken import { encodingForModel } from 'js-tiktoken';
function fitsInContext(text, model = 'gpt-4o', reservedOutput = 2048) {
const ctx = { 'gpt-4o': 128000, 'gpt-4o-mini': 128000 }[model];
const used = encodingForModel(model).encode(text).length;
return used + reservedOutput <= ctx;
} pip install anthropic import anthropic
client = anthropic.Anthropic()
result = client.messages.count_tokens(
model='claude-3-7-sonnet-20250224',
messages=[{'role': 'user', 'content': long_prompt}],
)
print(result.input_tokens) Related tools
- Token Counter
Count tokens for GPT-4o, GPT-4, GPT-3.5 and more — tokenizer-exact, runs in your browser.
- Multi-model Token Compare
Side-by-side token counts for the same input across GPT-4o, GPT-4 Turbo, GPT-3.5, Claude, Gemini.
- RAG Chunk Estimator
Estimate chunk count and embedding token spend from chunk size + overlap + corpus size.
- Embedding Dimension Reference
Reference table of embedding model output dimensions, max input tokens, and pricing.