How rate limits work
Rate limits are enforced per minute as a sliding window. Provider APIs return 429 Too Many Requests with a retry-after header when you breach. Two limits typically apply concurrently: requests-per-minute (RPM) and tokens-per-minute (TPM). You're throttled by whichever you hit first.
At the start of an account or for a new model, default limits are conservative. They scale with cumulative usage and time. For predictable production loads, file a limit-increase request before launch — the response is usually within a day.
FAQ
- Why are limits structured as both RPM and TPM?
- Either can throttle you. RPM caps the number of calls per minute regardless of size; TPM caps the total tokens flowing through. A workload with many small calls hits RPM first; one with few large calls hits TPM first.
- How do I get tier-upgrade?
- OpenAI and Anthropic both auto-upgrade based on accumulated spend + account age. There's usually a tier ladder published in the account dashboard. For larger jumps, contact sales.
- What happens if I exceed the limit?
- You get a 429 response. Most SDKs retry with exponential backoff. Persistent breaches risk temporary suspension; the dashboard surfaces breach counts.
- Why are the numbers in this tool so different from what I see in my dashboard?
- These are public defaults. Your actual limits depend on tier, region, and any custom limits the provider granted. Always verify with your account.
Common pitfalls
- Sizing infrastructure for a startup-tier limit when you should be at a higher tier already (check the dashboard).
- Forgetting parallelism — 10 workers each making 50 RPM = 500 RPM aggregate.
- Not implementing exponential backoff. A naive retry loop on 429 makes things worse.
- Treating short bursts as average traffic. Limits are per-minute, but burst-traffic patterns can briefly push 10× the average.
Related tools
- LLM Pricing Comparison
List-price reference for chat-model APIs across major providers, sortable and reviewed monthly.
- LLM Capability Matrix
Which frontier models support multimodal, vision, audio, JSON mode, and tool calling.
- Per-call LLM Cost Calculator
Pick a model, paste your prompt, set expected output length, see the per-call cost.
- Monthly LLM Cost Estimator
Plug in daily request volume + average token sizes, see monthly spend per model side-by-side.