Rate Limit / TPM Calculator

How rate limits work

Rate limits are enforced per minute as a sliding window. Provider APIs return 429 Too Many Requests with a retry-after header when you breach. Two limits typically apply concurrently: requests-per-minute (RPM) and tokens-per-minute (TPM). You're throttled by whichever you hit first.

At the start of an account or for a new model, default limits are conservative. They scale with cumulative usage and time. For predictable production loads, file a limit-increase request before launch — the response is usually within a day.

FAQ

Why are limits structured as both RPM and TPM?: Either can throttle you. RPM caps the number of calls per minute regardless of size; TPM caps the total tokens flowing through. A workload with many small calls hits RPM first; one with few large calls hits TPM first.
How do I get tier-upgrade?: OpenAI and Anthropic both auto-upgrade based on accumulated spend + account age. There's usually a tier ladder published in the account dashboard. For larger jumps, contact sales.
What happens if I exceed the limit?: You get a 429 response. Most SDKs retry with exponential backoff. Persistent breaches risk temporary suspension; the dashboard surfaces breach counts.
Why are the numbers in this tool so different from what I see in my dashboard?: These are public defaults. Your actual limits depend on tier, region, and any custom limits the provider granted. Always verify with your account.

Common pitfalls

Sizing infrastructure for a startup-tier limit when you should be at a higher tier already (check the dashboard).
Forgetting parallelism — 10 workers each making 50 RPM = 500 RPM aggregate.
Not implementing exponential backoff. A naive retry loop on 429 makes things worse.
Treating short bursts as average traffic. Limits are per-minute, but burst-traffic patterns can briefly push 10× the average.

Rate Limit / TPM Calculator

How rate limits work

FAQ

Common pitfalls

Related tools