Multi-model Token Compare

Why counts differ across models

Each model family is trained with its own BPE tokenizer. OpenAI's o200k_base (GPT-4o family) has ~200,000 sub-word units; cl100k_base (GPT-4, GPT-3.5) has ~100,000. Larger vocabularies map more characters to a single token, which means lower token counts for the same input — especially for non-English text where common multi-character sequences get their own merged token.

For pure English, the gap is usually 5–15%. For CJK or code, the o200k tokenizer can be 30–50% denser. That's why the same RAG passage might cost noticeably different amounts on GPT-4o vs GPT-3.5.

FAQ

Why does GPT-4o usually count fewer tokens than GPT-3.5?: GPT-4o uses o200k_base, which has roughly twice the vocabulary of cl100k_base (used by GPT-4 / GPT-3.5). Larger vocab packs more text per token, especially for multilingual content.
How accurate are the Claude / Gemini rows?: They're approximations using OpenAI's tokenizer as a proxy. In practice it's within ~10% for most English content; CJK can drift further. For accurate sizing, call Anthropic's count_tokens or Gemini's tokenization endpoint server-side.
Why care about the difference?: Two reasons: cost (you pay per input + output token) and context fit (different models have different windows). A prompt that fits comfortably in GPT-4o (128k) can still cost less to run on GPT-4o-mini, but might balloon at GPT-3.5 if it has heavy CJK content.
Why doesn't the page show DeepSeek / Mistral / Cohere?: Their tokenizers aren't available as a JS package today. We could approximate but the approximation drifts more than for the OpenAI-aligned families. We'd rather omit than mislead.

Common pitfalls

Estimating token cost on one model and shipping with another, then being surprised by the bill.
Trusting a "Claude tokenizer" npm package that's actually just tiktoken renamed.
Forgetting that tool / function definitions count as input tokens too.
Comparing chars, not tokens. Char-based estimates miss the real cost driver by a wide margin.

In your code

js-tiktoken js npm →

npm i js-tiktoken

import { encodingForModel } from 'js-tiktoken';

function compareModels(text) {
  return ['gpt-4o', 'gpt-4-turbo', 'gpt-3.5-turbo'].map(m => ({
    model: m,
    tokens: encodingForModel(m).encode(text).length,
  }));
}

Multi-model Token Compare

Why counts differ across models

FAQ

Common pitfalls

In your code

Related tools