Promptyard
Tokens

Context Window Calculator

Pick a model, paste your prompt, see how much context you have left after reserving output tokens.

Max output for this model: 16,384
Input usage
0 / 128,000
125,952 tokens free after reserve
Input (0.0%)Reserved output (1.6%)Max input that fits: 125,952 tokens

How it works

Every chat completion has a fixed budget: context window. That budget is split between everything you send (system prompt + history + tools + RAG passages + the current user message) and everything the model sends back. If input + reserved output exceeds the limit, the API rejects the call or silently truncates.

For OpenAI models, this tool counts your input exactly via js-tiktoken. For Anthropic and Google, we approximate using OpenAI's tokenizer because their tokenizers aren't publicly distributed in JS — close enough for sanity checks, not for billing-precise estimates.

The bar shows three regions: input (cyan, what you've used), reserved output (paper grey, what you've held back), and free space. If the bar overflows, you see how many tokens to trim.

FAQ

Why reserve output tokens up front?
The model's output competes with your input for the same context budget. If you cram input right up to the limit, the response gets truncated mid-sentence. Reserving 2–8k for output is standard.
Why are non-OpenAI models marked "approximate"?
Anthropic and Google don't ship a JS tokenizer that exactly matches their server-side BPE. We use OpenAI's as a stand-in (≈10% off in practice). For production sizing, call the provider's own count_tokens endpoint.
Does the system prompt count?
Yes. System prompt + conversation history + retrieved context + the user's current message all share the input budget. Forget the system prompt at your peril — it's often hundreds of tokens.
Why does Gemini 2 Pro have a 2M context but the API caps shorter?
Some context-window claims are nominal — long-context models often have throughput, latency, or pricing surcharges past a certain length. Read the provider's docs for "actual usable" limits.

Common pitfalls

  • Letting the model decide output length on a near-full input — outputs get truncated mid-sentence.
  • Forgetting to count function-calling schemas. Tool definitions live in the input budget too.
  • Sending PDF / image bytes as base64 in the prompt and not realising each kilobyte ≈ 200 tokens.
  • Treating Gemini's 2M context as free real-estate. Latency and price scale with usage.

In your code

js-tiktoken js npm →
npm i js-tiktoken
import { encodingForModel } from 'js-tiktoken';

function fitsInContext(text, model = 'gpt-4o', reservedOutput = 2048) {
  const ctx = { 'gpt-4o': 128000, 'gpt-4o-mini': 128000 }[model];
  const used = encodingForModel(model).encode(text).length;
  return used + reservedOutput <= ctx;
}
Anthropic count_tokens python pypi →
pip install anthropic
import anthropic

client = anthropic.Anthropic()
result = client.messages.count_tokens(
    model='claude-3-7-sonnet-20250224',
    messages=[{'role': 'user', 'content': long_prompt}],
)
print(result.input_tokens)

Related tools