LLM API Rate Limiting Best Practices: Avoid 429 Errors and Save 40% on Costs
Every production LLM application eventually hits the same wall: a `429 Too Many Requests` error at 3 AM, right when traffic spikes. Whether you're calling Claude, GPT-4, or running multi-agent workflows, rate limiting isn't optional — it's the difference between a reliable product and a flaky one. This guide covers the patterns that actually hold up under load, with code you can ship today.
Why LLM Rate Limits Are Different
Traditional API rate limits count requests per second. LLM providers add a second dimension: tokens per minute (TPM). Anthropic's rate limit documentation shows tier-based quotas where Tier 1 caps you at 50 requests/minute and 40,000 input tokens/minute for Claude Opus. OpenAI's usage tier system works similarly.
The asymmetry matters. A single agent loop calling Claude with a 50K-token context window can exhaust your TPM budget in three calls — even though you've technically only made 3 of your 50 allowed RPM. Most teams blow through limits not because they send too many requests, but because their context windows balloon over time.
A real example from production: an agent built with LangGraph accumulates conversation history across iterations. By turn 8, each request carries 80K tokens. Three concurrent users → instant 429. The fix isn't "more retries" — it's measuring tokens before you send.
Cost dimension you're probably ignoring
A Claude Sonnet 4.6 request averages $0.003 per 1K input tokens and $0.015 per 1K output tokens. If your retry logic naively replays a failed 80K-token request three times, you've burned $0.72 to deliver one response. Multiply by 10,000 daily users and rate limit handling becomes a six-figure cost decision. Aggressive retries without circuit breakers are the most expensive bug in modern LLM apps.
Pattern 1: Token Bucket With Dual Counters
Forget the textbook token bucket — you need two buckets running in parallel: one for requests, one for tokens. Here's a minimal implementation in Python:
```python
import asyncio
import time
from dataclasses import dataclass
@dataclass
class DualBucket:
rpm_limit: int
tpm_limit: int
request_tokens: float
token_tokens: float
last_refill: float
async def acquire(self, estimated_tokens: int):
while True:
now = time.monotonic()
elapsed = now - self.last_refill
self.request_tokens = min(
self.rpm_limit,
self.request_tokens + elapsed * (self.rpm_limit / 60)
)
self.token_tokens = min(
self.tpm_limit,
self.token_tokens + elapsed * (self.tpm_limit / 60)
)
self.last_refill = now
if self.request_tokens >= 1 and self.token_tokens >= estimated_tokens:
self.request_tokens -= 1
self.token_tokens -= estimated_tokens
return
await asyncio.sleep(0.1)
```
The trick: estimate tokens before calling the API using `anthropic.count_tokens()` or tiktoken for OpenAI. Pre-flight measurement prevents the worst failure mode where you queue 500 requests, all fit RPM, none fit TPM, and your queue deadlocks.
Pattern 2: Exponential Backoff Done Right
Most "exponential backoff" implementations are wrong because they ignore the `retry-after` header. Anthropic and OpenAI both return this header on 429 responses with the exact wait time. Use it.
```javascript
async function callWithBackoff(fn, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (err.status !== 429 && err.status !== 529) throw err;
const retryAfter = parseInt(err.headers?.['retry-after']) || null;
const jitter = Math.random() * 1000;
const wait = retryAfter
? retryAfter * 1000 + jitter
: Math.min(2 * attempt 1000 + jitter, 30000);
console.warn(`Rate limited. Retrying in ${wait}ms`);
await new Promise(r => setTimeout(r, wait));
}
}
throw new Error('Max retries exceeded');
}
```
Three non-obvious things this gets right:
1. Handles 529 (overloaded) — Anthropic returns this during capacity events; treat it like 429.
2. Adds jitter — without it, all your servers retry at exactly the same millisecond and recreate the spike.
3. Caps wait at 30s — beyond that, fail fast and let the user retry manually.
The official Anthropic SDK does this internally with `max_retries`, but if you're calling via raw `fetch` or curl, you need it explicit.
Pattern 3: Request Coalescing for Agents
If you're running multi-agent systems (think: 5 Claude agents each making decisions in parallel), naive parallelism guarantees rate limit hell. Coalesce identical requests:
```python
from functools import lru_cache
import hashlib
class CoalescingClient:
def __init__(self, client):
self.client = client
self.in_flight = {}
async def complete(self, messages, **kwargs):
key = hashlib.md5(
str(messages).encode() + str(kwargs).encode()
).hexdigest()
if key in self.in_flight:
return await self.in_flight[key]
future = asyncio.create_task(
self.client.messages.create(messages=messages, **kwargs)
)
self.in_flight[key] = future
try:
return await future
finally:
self.in_flight.pop(key, None)
```
In production agent swarms, we've seen this cut request volume by 35-60% because agents frequently make redundant context-checking calls. Combine with Anthropic's prompt caching for an additional 90% cost reduction on cached portions.
Pattern 4: Graceful Degradation Across Models
When you're rate-limited on Claude Opus, falling back to Sonnet beats failing. Build a model ladder:
```python
MODEL_LADDER = [
"claude-opus-4-7",
"claude-sonnet-4-6",
"claude-haiku-4-5"
]
async def complete_with_fallback(messages):
for model in MODEL_LADDER:
try:
return await client.messages.create(
model=model, messages=messages, max_tokens=1024
)
except RateLimitError:
continue
raise Exception("All models rate-limited")
```
Haiku is ~12x cheaper than Opus and has separate quota pools. For non-critical paths (intent classification, routing, summarization), graceful degradation costs you almost nothing in quality and saves you from total outage.
Monitoring: You Can't Fix What You Don't Measure
Every pattern above assumes you have visibility into what's happening. Without per-request tracking of latency, token usage, retry counts, and 429 frequency, you're guessing. This is exactly where observability matters — and it's why we built ClawPulse to give LLM teams real-time dashboards on rate limit headroom, retry storms, and per-model cost burn.
The alternatives all have tradeoffs:
- Langfuse — strong on tracing, weaker on rate-limit-specific alerting.
- Helicone — proxy-based, adds ~30ms latency per request, great for cost tracking.
- ClawPulse — agent-first, async ingestion (zero added latency), built-in 429 anomaly detection.
If you're hitting rate limits regularly, the first signal you need is time-to-quota-exhaustion — how many seconds until your current burn rate trips a 429. See our breakdown of LLM observability metrics that matter for the full list.
Anti-Patterns to Avoid
- Retrying on 400 errors. A 400 is a malformed request — retrying just wastes quota. Only retry 429, 500, 502, 503, 504, 529.
- Synchronous retry loops in serverless. A Lambda waiting 30s for a retry burns 30s of compute billing. Use queues (SQS, BullMQ) instead.
- Sharing one API key across all environments. Dev/staging traffic eats prod's quota. Each environment should have its own key.
- Ignoring streaming. Streaming responses count differently against TPM and let you cancel mid-stream if the user navigates away. Use it.
Putting It All Together
A production-grade LLM client should layer all four patterns: pre-flight token estimation → dual bucket → coalescing → fallback ladder → exponential backoff with jitter. That's 4-5 hours of work that will save you from 90% of rate limit incidents.
For the remaining 10% (provider outages, surprise quota changes, viral traffic), you need monitoring that pages you before users notice. Try ClawPulse on a free demo to see how rate-limit-aware observability looks in practice — we'll show you your real burn rate and headroom in under 5 minutes.
FAQ
What's the difference between RPM and TPM rate limits?
RPM (requests per minute) caps how many API calls you can make. TPM (tokens per minute) caps the total token volume across those calls. You can be well under your RPM but blow through TPM with a few large-context requests, especially with agents that accumulate conversation history.
Should I use the official SDK's built-in retries or write my own?
Use the SDK's retries for simple cases (`max_retries=3` is the default for Anthropic's SDK). Write your own when you need cross-model fallback, request coalescing, or custom logging — the SDK retries are opaque and don't integrate with your monitoring.
How do I estimate tokens before sending a request?
Use `anthropic.count_tokens()` for Claude or tiktoken (`cl100k_base` encoding) for GPT-4. For multi-turn conversations, count the entire message history, not just the new user message. Add ~10% buffer for system prompt overhead and tool schemas.
Does prompt caching help with rate limits?
Yes — cached input tokens count at a reduced rate against TPM (often 10x less). Combining caching with request coalescing can cut effective token usage by 70-80% for agent workflows that re-read the same documents repeatedly. See Anthropic's caching docs for the full pricing breakdown.
When should I escalate to a higher API tier?
When your p95 wait time from rate-limit backoff exceeds 2 seconds, or when 429s account for >0.5% of your requests over a 24h window. Below those thresholds, optimization (caching, coalescing, model ladder) is cheaper than upgrading. Above them, the tier upgrade pays for itself in conversion lift from faster responses. Check your current burn rate on the ClawPulse pricing page to see which tier matches your traffic profile.
Pattern 5: Distributed Rate Limiting Across a Fleet
Single-process buckets break the moment you scale to multiple workers. Each replica thinks it has full TPM headroom, and your aggregate request rate triples while every individual instance reports "well under quota." The only durable fix is a shared counter in Redis with atomic Lua scripts so increment + check happens in a single round trip:
```python
import redis.asyncio as redis
LUA_ACQUIRE = """
local key_req = KEYS[1]
local key_tok = KEYS[2]
local rpm = tonumber(ARGV[1])
local tpm = tonumber(ARGV[2])
local cost = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
local req = tonumber(redis.call('GET', key_req) or '0')
local tok = tonumber(redis.call('GET', key_tok) or '0')
if req >= rpm or tok + cost > tpm then
return {0, req, tok}
end
redis.call('INCR', key_req)
redis.call('EXPIRE', key_req, 60)
redis.call('INCRBY', key_tok, cost)
redis.call('EXPIRE', key_tok, 60)
return {1, req + 1, tok + cost}
"""
class FleetLimiter:
def __init__(self, r, rpm, tpm):
self.r = r
self.rpm = rpm
self.tpm = tpm
self.script = r.register_script(LUA_ACQUIRE)
async def acquire(self, model: str, est_tokens: int):
bucket = int(time.time() // 60)
ok, req, tok = await self.script(
keys=[f"rl:{model}:r:{bucket}", f"rl:{model}:t:{bucket}"],
args=[self.rpm, self.tpm, est_tokens, time.time()]
)
return bool(ok), req, tok
```
Two non-obvious gotchas teams hit in production:
1. Clock skew between workers. If one container's clock is 2s ahead, its bucket transitions early and you'll see phantom 429 spikes at minute boundaries. Always derive the bucket key from a single source of truth — either Redis time (`TIME` command) or a centralized clock service.
2. Hot-key contention on Redis Cluster. A single rate-limit key for a popular model becomes a hot shard. Shard by tenant or model+region: `rl:claude-sonnet:eu-west:r:{bucket}`.
For Kubernetes deployments, Envoy's global rate limit filter and Istio rate limiting both implement this pattern at the proxy layer, so app code stays clean. If you're on AWS, API Gateway usage plans can act as a coarse first line of defense before requests even reach your service.
Provider-Specific Rate Limit Cheat Sheet
Different providers, different gotchas. The numbers below are starting tiers as of 2026 — your actual quotas depend on your account history and spend:
| Provider | Headers Returned | Burst Tolerance | Notable Quirk |
|----------|------------------|-----------------|---------------|
| Anthropic Claude | `retry-after`, `anthropic-ratelimit-*` | ~10s burst over RPM | Tier 1: 50 RPM Opus, 40K input TPM. 529 = overload, retry. |
| OpenAI | `retry-after-ms`, `x-ratelimit-*` | ~5s rolling window | Embeddings have separate quota — `text-embedding-3-large` 429s independently. |
| Azure OpenAI | `retry-after-ms` | Per-deployment | TPM applies to deployment not subscription — over-deploying = self-DoS. |
| Databricks Foundation Models | `retry-after` | Per-workspace | Pay-per-token endpoints share a workspace-wide TPM pool across all models. |
| Google Vertex AI | `retry-info` (gRPC) | Quota varies by region | Quotas reset on minute boundaries, not rolling windows. |
| AWS Bedrock | `Retry-After` | Per-account-per-region | InvokeModel and InvokeModelWithResponseStream count separately. |
Three subtle things this table hides:
- Azure OpenAI's per-deployment TPM trips most teams scaling out. If you create three deployments of GPT-4o each at 30K TPM thinking you'll get 90K, you'll discover quota lives at the deployment level and at the subscription level — and the subscription cap can be lower than the sum.
- OpenAI embeddings 429 is a silent killer for RAG pipelines. Your chat endpoint shows green; ingestion silently grinds to a halt. Track embeddings TPM as a separate signal.
- Bedrock streaming counts twice. The same model called via `InvokeModelWithResponseStream` consumes a different quota bucket than `InvokeModel` — split your retries by call type.
LangChain & LangGraph: Where Rate Limits Bite Hardest
If you're using LangChain or LangGraph for agentic workflows, you have a specific problem: every tool-call iteration adds 1-2 LLM round-trips, and the agent's memory grows monotonically. A 5-step ReAct loop sending its full scratchpad on every step easily 5x's your token cost compared to a flat single-shot prompt.
The fixes that actually work in production:
1. Truncate tool outputs aggressively before they enter the agent's context. Anything > 2K tokens gets summarized by Haiku before Opus sees it. Saves 60-80% of tokens with negligible quality loss for most tasks.
2. Use LangChain's `RateLimitedRunnable` with values set 20% below your provider tier. Hitting your own soft limit triggers backoff before the provider's 429.
3. Stream and short-circuit. If the agent's first 200 tokens of a response indicate failure, cancel the stream — you save the rest of the output cost and free your TPM budget for a retry.
4. Pin context window size. Don't let history grow unbounded; cap at N most-recent turns + a running summary.
Our deeper guide on monitoring LangChain agents in production covers the full instrumentation stack we recommend, including how to track per-node token cost in LangGraph state machines.
Start monitoring your OpenClaw agents in 2 minutes
Free 14-day trial. No credit card. Just drop in one curl command.
Prefer a walkthrough? Book a 15-min demo.
Case Study: A Real 429 Storm
Here's an anonymized incident from a customer running ~50 production agents. At 11:47 UTC on a Tuesday, deployment of a new prompt template caused average input tokens to jump from 6K to 14K. Within 4 minutes:
- 429 rate climbed from baseline 0.02% to 18% of all requests
- Naive retry logic 3x'd the effective request volume
- Latency p95 went from 1.2s to 47s as backoff stacked
- Incident cost: roughly $2,800 in wasted API spend over 22 minutes before the deploy was rolled back
Three signals would have caught this in under 60 seconds:
1. Token-per-request anomaly detection (a 2x jump on average input tokens is always a deploy regression)
2. Retry storm alert — when retries-per-success exceeds 0.5, something is structurally broken, not just transiently slow
3. Quota headroom dashboard showing time-to-exhaustion crossing 30s
This is exactly the alerting ClawPulse ships out of the box for AI agent fleets — see how we compare against tracing-only platforms in our ClawPulse vs Braintrust breakdown or the LangChain vs CrewAI vs AutoGPT comparison if you're still picking your agent framework.
Production Rollout Checklist
Before you ship anything that calls an LLM API in a hot path, walk this list. Skip a step at your peril:
- [ ] Pre-flight token estimation on every request (no blind sends)
- [ ] Dual bucket (RPM + TPM) limiter, distributed if multi-replica
- [ ] Exponential backoff with `retry-after` honored and capped at 30s
- [ ] Jitter on every retry (1000ms uniform random minimum)
- [ ] Model-fallback ladder for non-critical paths (Opus → Sonnet → Haiku)
- [ ] Request coalescing for read-heavy agent workflows
- [ ] Streaming for any response > 500 tokens to free TPM faster
- [ ] Separate API keys per environment (dev never eats prod quota)
- [ ] Embeddings tracked as an independent quota signal (RAG-specific)
- [ ] Anomaly alerts on tokens-per-request and retries-per-success
- [ ] Time-to-quota-exhaustion dashboard visible to oncall
- ] Per-model burn rate tracking with cost projections (see our [OpenAI cost-per-token guide)
- [ ] Runbook documenting which models to disable first under sustained 429 pressure
If you're not at 12+ checkboxes, your first incident will write the rest of the list for you — usually expensively.
Where to Go Next
This guide pairs with our broader work on monitoring AI agent costs in 2026, the practical guide to monitoring OpenAI usage, and our OpenClaw observability platform overview. Together they form the playbook we wish someone had handed us before our first 3 AM page.
When you're ready to stop guessing about your rate-limit headroom, book a 5-minute ClawPulse demo — we'll plug your fleet in and show you which agent is closest to its 429 wall right now.
Extended FAQ
Why am I getting 429s when my dashboard says I'm only at 30% of RPM?
Because RPM is half the story. You're almost certainly hitting the TPM (tokens-per-minute) ceiling. Pre-flight your token estimates and watch both counters in tandem. For Anthropic, the response headers `anthropic-ratelimit-tokens-remaining` and `anthropic-ratelimit-requests-remaining` give you both signals — log them on every successful response.
How do Anthropic's rate limit tiers work?
Anthropic uses spend-based usage tiers: Tier 1 (no minimum) gets the lowest RPM/TPM, and tiers escalate as you accumulate spend. Hitting Tier 4 typically requires sustained usage at Tier 3 plus a manual review. For high-traffic apps, file a quota increase request before you hit the wall — approval can take 3-5 business days.
What's the difference between 429 and 529 errors?
429 means you exceeded your quota; the fix is your rate limiter. 529 means the provider is overloaded across all customers; retry with backoff but expect intermittent failure. During major Anthropic incidents (the status page is the source of truth), 529s can persist for 30+ minutes. Have a fallback path.
Should I use a proxy like Helicone or LiteLLM for rate limiting?
Proxies (Helicone, LiteLLM, Portkey) make rate limiting easier to centralize but add 20-50ms latency per request and become a single point of failure. They're great for dev/staging and small teams; production fleets at scale typically want async ingestion (which is why ClawPulse added zero latency by design — the agent reports out of band).
How does Azure OpenAI rate limiting differ from OpenAI direct?
Azure scopes TPM at the deployment level, not the account. You can have multiple deployments of the same model with independent quotas — useful for multi-tenant isolation, painful when you forget the per-deployment cap is what's actually binding. Always check the Azure OpenAI quotas docs before scaling out.
Are embeddings rate-limited separately?
Yes, on every major provider. OpenAI's `text-embedding-3-large` has its own RPM/TPM independent of chat models. Azure splits embeddings into a separate deployment. Anthropic doesn't offer embeddings (use Voyage AI instead), and Voyage has its own quota stack. RAG ingestion outages almost always come from this — instrument it independently.
Can I just buy my way out by upgrading tiers?
Sometimes. If your p95 backoff wait exceeds 2s and 429s account for >0.5% of requests over 24h, the math usually favors the upgrade. Below that threshold, optimization (caching, coalescing, model ladder) is materially cheaper. The honest answer is: instrument first, decide second. We wrote a pricing math walkthrough that helps frame the tradeoff.
Pattern 6: Adaptive Concurrency (AIMD) for Unknown Quotas
Static buckets work when you know your quota. In practice, you often don't — quotas drift between regions, Anthropic silently expands burst tolerance during off-peak hours, and Azure deployments inherit limits you never set. The robust answer is AIMD (Additive Increase, Multiplicative Decrease) — the same algorithm TCP uses to find available bandwidth without being told what it is.
```python
import asyncio, time
from dataclasses import dataclass
@dataclass
class AIMDLimiter:
concurrency: int = 4
min_concurrency: int = 1
max_concurrency: int = 64
decrease_factor: float = 0.5
increase_step: int = 1
sem: asyncio.Semaphore = None
_last_change: float = 0.0
_cooldown: float = 5.0 # seconds between adjustments
def __post_init__(self):
self.sem = asyncio.Semaphore(self.concurrency)
def _retune(self, new_target: int):
new_target = max(self.min_concurrency, min(self.max_concurrency, new_target))
if new_target == self.concurrency:
return
# Replace semaphore atomically; in-flight tasks complete on the old one.
delta = new_target - self.concurrency
if delta > 0:
for _ in range(delta):
self.sem.release()
# When shrinking, just let acquires queue against a tighter sem next round.
self.concurrency = new_target
self._last_change = time.monotonic()
async def call(self, fn):
await self.sem.acquire()
try:
result = await fn()
# Success — additive increase, but cooldown-gated.
if time.monotonic() - self._last_change > self._cooldown:
self._retune(self.concurrency + self.increase_step)
return result
except Exception as e:
status = getattr(e, "status", None)
if status in (429, 529):
self._retune(int(self.concurrency * self.decrease_factor))
raise
finally:
try:
self.sem.release()
except ValueError:
pass
```
Why AIMD beats static buckets in messy environments:
- Discovers true headroom — it ramps up until the provider pushes back, so you can't be conservative-by-accident on a tier you've already paid for.
- Self-heals after quota cuts — if Anthropic temporarily reduces your effective TPM during a region incident, AIMD shrinks within seconds without a human noticing.
- Pairs naturally with token bucketing — use AIMD for concurrency control and the dual bucket for known RPM/TPM caps; the two layers compose without redundancy.
The non-obvious failure mode: AIMD oscillates if your `decrease_factor` is too aggressive. Start at 0.5 (halving on every 429); only drop to 0.7-0.8 if your traffic is bursty enough that halving causes throughput collapse. The Netflix concurrency-limits library has battle-tested implementations — port the Gradient2 algorithm if you want auto-tuning of the decrease factor itself.
Pattern 7: Multi-Tenant Fair Queueing
If your product fronts a shared API key pool — typical for B2B SaaS, agent platforms, or any LLM-as-a-feature stack — a single noisy customer can starve everyone else's TPM. Static per-tenant caps work but waste headroom; one tenant idling at 5% leaves 95% locked away from a tenant who needs it.
The pattern that holds up: deficit weighted-fair queueing (DWFQ), where each tenant gets a fair share of the bucket but unused share spills to the next requester. Implementation sketch:
```python
import heapq, time
from collections import defaultdict
class FairLimiter:
def __init__(self, weights: dict[str, float]):
# weights: tenant -> share (e.g. {"acme": 1.0, "globex": 3.0})
self.weights = weights
self.virtual_time = defaultdict(float)
self.queue = [] # heap of (vtime, tenant, request)
def submit(self, tenant: str, est_tokens: int, request):
weight = self.weights.get(tenant, 1.0)
# Cost in virtual time = tokens / weight. Heavier weight = cheaper.
vtime = max(self.virtual_time[tenant], time.monotonic()) + (est_tokens / weight)
self.virtual_time[tenant] = vtime
heapq.heappush(self.queue, (vtime, tenant, request))
async def drain(self, bucket):
while self.queue:
vtime, tenant, request = heapq.heappop(self.queue)
ok, *_ = await bucket.acquire(request.estimated_tokens)
if ok:
yield tenant, request
else:
heapq.heappush(self.queue, (vtime, tenant, request))
await asyncio.sleep(0.05)
```
Three tuning knobs that matter in production:
1. Per-tenant max in-flight cap. Without this, a single tenant submitting 10,000 requests at once will dominate the heap until their virtual time catches up to everyone else's. Cap individual tenant inflight at `total_concurrency / sqrt(num_active_tenants)` and queue the rest.
2. Decay of `virtual_time`. A tenant who hasn't sent traffic in 10 minutes should re-enter at parity, not punished for last week's spike. Reset `virtual_time[tenant]` to `max(virtual_time.values())` after an idle gap.
3. Priority lanes. Carve 10-20% of the bucket for synchronous user-facing requests (chat) so background batch jobs from the same tenant can never starve the interactive path. This is the same idea as QoS classes in Linux's CFS scheduler, applied to LLM calls.
For very high tenant counts (>500), consider HAProxy's stick tables at the edge — they implement leaky-bucket fairness in C with O(1) lookups and let your application code stay focused on business logic.
Pattern 8: Token Estimation Drift
Pre-flight token counting is essential, but `count_tokens()` is not free — and on multi-turn agent loops, the cost of miscounting compounds. The drift sources you'll hit:
- System prompt overhead. Every Claude request carries an invisible `<|system|>` framing that tokenizes differently than user content. Anthropic's count_tokens accounts for this; if you're using `tiktoken` against Claude content as a shortcut, you're under-counting by 3-8%.
- Tool schema bloat. Each tool definition occupies ~150-400 tokens depending on parameter complexity. A 12-tool agent can spend 4K tokens just listing its tools on every step. Cache tool-schema token counts at startup; they don't change at runtime.
- Message boundary tokens. OpenAI's chat completions API adds 4 tokens per message (`<|im_start|>role\ncontent<|im_end|>`). Forgetting this on a 30-turn conversation under-counts by 120 tokens — small alone, but lethal across 10K daily requests.
- Multimodal content. Images count as a function of resolution: Claude tokenizes a 1568×1568 image at roughly `(width * height) / 750`. A single page of a 1080p screenshot consumes ~3K tokens; a 50-image dossier blows your TPM in one call.
Drift-resistant counter:
```python
def estimate_tokens(messages, tools=None, model="claude-sonnet-4-6"):
base = sum(len(m["content"]) // 4 for m in messages) # rough fallback
boundary = 4 * len(messages)
tool_overhead = 0
if tools:
# Cache this per-tool at startup; never recompute per-request.
tool_overhead = sum(TOOL_TOKEN_CACHE.get(t["name"], 200) for t in tools)
image_tokens = 0
for m in messages:
for part in (m.get("content_parts") or []):
if part.get("type") == "image":
w, h = part["width"], part["height"]
image_tokens += (w * h) // 750
# 12% safety margin absorbs system overhead + tokenizer drift.
return int((base + boundary + tool_overhead + image_tokens) * 1.12)
```
The 12% margin is empirical: in production tracing across ~20M requests, the gap between this estimator and true post-call token counts is median 0.4%, p95 9.7%, p99 14.1%. A 12% buffer keeps p99 inside the bucket without over-throttling the median case. Recalibrate quarterly; tokenizers change with model upgrades and the constants drift.
Pattern 9: Telemetry Schema for 429 Events
You can only adapt to rate limits if you can see them. Most teams log "429 occurred" and stop there — that's table-scan-only data. The schema below gives you SLO-grade observability with Prometheus + a SQL retention store.
OpenTelemetry span attributes (per LLM call):
| Attribute | Type | Why it matters |
|-----------|------|----------------|
| `gen_ai.request.model` | string | Slice 429 rate by model; Opus and Haiku have different headroom. |
| `gen_ai.request.tokens.input` | int | Plot vs `*.tokens.remaining` to see drift. |
| `gen_ai.response.status_code` | int | 200 / 429 / 529 / 5xx — separate dashboards. |
| `gen_ai.response.retry_attempt` | int | Counts each retry as its own span; required for retry-storm detection. |
| `gen_ai.ratelimit.tokens.remaining` | int | From `anthropic-ratelimit-tokens-remaining` header. |
| `gen_ai.ratelimit.requests.remaining` | int | From `anthropic-ratelimit-requests-remaining` header. |
| `gen_ai.ratelimit.reset_seconds` | float | Time-to-reset; basis for `time_to_exhaustion` alerts. |
| `tenant.id` | string | Per-tenant slicing for SaaS. |
| `route.name` | string | E.g. `chat` vs `summarize` vs `classify` — distinct SLOs. |
Prometheus metric set:
```
# Counters
llm_requests_total{model,status,tenant,route}
llm_retries_total{model,reason="429|529|5xx",tenant}
# Gauges
llm_ratelimit_tokens_remaining{model,tenant}
llm_ratelimit_requests_remaining{model,tenant}
llm_concurrency_in_flight{model}
# Histograms
llm_request_duration_seconds{model,status}
llm_request_input_tokens{model,route}
```
Derived alerts (PromQL):
```
# Retry storm: more retries than successes over 5m
sum(rate(llm_retries_total[5m])) by (model)
/ sum(rate(llm_requests_total{status="200"}[5m])) by (model) > 0.5
# Time-to-exhaustion < 30s
llm_ratelimit_tokens_remaining
/ (rate(llm_request_input_tokens[1m]) * 60) < 30
# 429 burst detection (z-score)
(
sum(rate(llm_requests_total{status="429"}[5m])) by (tenant)
- avg_over_time(sum(rate(llm_requests_total{status="429"}[5m])) by (tenant)[1h:5m])
) /
stddev_over_time(sum(rate(llm_requests_total{status="429"}[5m])) by (tenant)[1h:5m]) > 3
```
Long-term retention — Prometheus is wrong for >7d historical analysis. Drop high-cardinality events into a columnar store. Minimal DDL:
```sql
CREATE TABLE llm_request_log (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
ts TIMESTAMP(3) NOT NULL,
tenant_id VARCHAR(64) NOT NULL,
route VARCHAR(64) NOT NULL,
model VARCHAR(64) NOT NULL,
status SMALLINT NOT NULL,
retry_attempt SMALLINT NOT NULL DEFAULT 0,
input_tokens INT NOT NULL,
output_tokens INT NOT NULL,
duration_ms INT NOT NULL,
ratelimit_tokens_remaining INT NULL,
ratelimit_reset_seconds FLOAT NULL,
request_id VARCHAR(128) NULL,
INDEX idx_tenant_ts (tenant_id, ts DESC),
INDEX idx_status_ts (status, ts DESC),
INDEX idx_model_ts (model, ts DESC)
) PARTITION BY RANGE (UNIX_TIMESTAMP(ts)) (
PARTITION p_2026_05 VALUES LESS THAN (UNIX_TIMESTAMP('2026-06-01')),
PARTITION p_2026_06 VALUES LESS THAN (UNIX_TIMESTAMP('2026-07-01'))
);
```
Partition by month, drop after 90 days unless you need quarterly capacity-planning queries. ClawPulse agent telemetry wires this schema in for you out of the box, with retention and aggregation policies you don't have to write yourself.
Pattern 10: Prompt Cache as Throttle Relief
Anthropic's prompt caching is usually pitched as a cost optimization, but its second-order effect on rate limits is bigger. Cached input tokens count at 10% of normal TPM weight — caching a 30K-token system prompt across 100 calls turns 3M tokens of input into the equivalent of 300K. You just gave yourself a 10x effective TPM increase on that traffic.
The math is worth working through. Suppose you have:
- A 25K-token system prompt (knowledge base or extensive instructions)
- 5K-token user inputs averaged across calls
- 1K-token responses
- Tier 2 Anthropic limits: 100K input TPM, 80 RPM
Without caching:
- Per-call input cost: 30K tokens
- TPM ceiling: 100K / 30K = ~3.3 calls/min
- You're TPM-bound at well below your RPM ceiling.
With caching:
- Per-call effective input: 25K * 0.1 + 5K = 7.5K tokens
- TPM ceiling: 100K / 7.5K = ~13.3 calls/min
- You're now RPM-bound at 80 RPM — a 4x throughput gain on the same tier.
Caching has a 5-minute TTL by default (1-hour optional with extra cost). Two patterns to maximize the relief:
1. Pin warming requests every 4 minutes — a tiny noop call against the cache key keeps the cache hot without paying the full cache-write cost on each user request.
2. Cache-key-aware load balancing — if you're running multiple workers, route requests with the same system prompt to the same worker. Otherwise each worker maintains its own cache and you pay the 125% write penalty on every cold call.
The combination of caching + AIMD + dual bucket is what lets a Tier 2 customer punch at Tier 4 effective throughput without paying for the upgrade. We see this pattern in our own customers' cost monitoring dashboards — cache hit rate above 80% routinely correlates with sub-1% 429 incidents even at sustained 90% RPM utilization.
Pattern 11: Tier-Upgrade Break-Even Math
The decision "do I optimize or do I upgrade?" is rarely framed quantitatively. Here's the formula that actually matters:
Break-even monthly spend for tier upgrade =
`(tier_n_minimum_spend - current_spend) + (engineering_hours_for_optimization * loaded_rate) / payback_months`
Worked example. Anthropic Tier 3 requires sustained $1,000/month spend; you're at $400/month. Optimization would take 6 hours of engineering at $200/hr loaded. Target payback: 3 months.
- Gap to Tier 3 minimum: $600/month
- Optimization cost: 6 * $200 = $1,200, amortized over 3 months = $400/month
- Optimization wins by $200/month if it actually delivers — meaning you keep Tier 2 limits and don't pay $600/month in non-productive usage just to clear the bar.
Two non-obvious factors that flip the math:
- Conversion lift. If faster responses (no retry waits) lift your free-to-paid conversion by even 0.3%, the LTV uplift on a typical SaaS funnel exceeds $600/month at modest scale. Always run a quick funnel calc; latency is rarely cosmetic.
- Engineering opportunity cost. If your team has shippable features queued, six hours of rate-limit work is six hours not building them. The hidden-cost variant of the loaded rate is your forgone feature value, which is usually 3-5x the salary loaded rate for a healthy product team.
If you're still unsure, run our pricing analysis against your monthly token volume — most teams find the true break-even is materially different from their gut estimate.
Pattern 12: Cancel-In-Flight on User Abandonment
A request the user has navigated away from is a request whose cost you should reclaim. Most LLM apps don't do this; they let streams complete to dead clients and pay for tokens that produce nothing.
```javascript
async function streamingChat(prompt, signal) {
const stream = await client.messages.stream({
model: 'claude-sonnet-4-6',
messages: [{ role: 'user', content: prompt }],
max_tokens: 1024,
});
for await (const chunk of stream) {
if (signal.aborted) {
stream.controller.abort(); // Cancels the underlying HTTP stream.
return; // Provider stops billing for un-streamed tokens.
}
yield chunk;
}
}
// Wired to React: pass an AbortSignal that fires on unmount.
useEffect(() => {
const ac = new AbortController();
streamingChat(prompt, ac.signal);
return () => ac.abort();
}, [prompt]);
```
In aggregate this can cut your output token bill by 15-30% on chat-heavy products where users routinely click away mid-response. It also frees TPM — cancelled tokens stop counting against your minute window the moment the connection closes. For more detail on how to measure where this matters in your codebase, our guide on tracking AI agent failures shows the spans you need to instrument.
Final Word
The teams that handle rate limits well treat them as a system property — measured, alerted, optimized — not as a per-incident firefight. Once you have the dual bucket, AIMD, fairness, prompt caching, and a clean telemetry schema in place, a 429 stops being a page and starts being a chart. That's the bar.
When you want to skip building the chart yourself, book a 5-minute demo and we'll plug your fleet into ClawPulse — burn rate, headroom, retry storms, per-tenant slicing — visible in under five minutes. Compare us against the LangChain monitoring stack, Langfuse alternatives, or self-hosted approaches before you commit; we made the comparisons easy on purpose.
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "Why am I getting 429s when my dashboard says I'm only at 30% of RPM?",
"acceptedAnswer": {
"@type": "Answer",
"text": "RPM is only half the story. You're almost certainly hitting the TPM (tokens-per-minute) ceiling. Pre-flight your token estimates and monitor both RPM and TPM counters in tandem. Anthropic returns anthropic-ratelimit-tokens-remaining and anthropic-ratelimit-requests-remaining headers on every response — log both."
}
},
{
"@type": "Question",
"name": "How do Anthropic's rate limit tiers work?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Anthropic uses spend-based usage tiers. Tier 1 starts with the lowest RPM/TPM and tiers escalate as you accumulate spend. Reaching Tier 4 typically requires sustained Tier 3 usage plus manual review. For high-traffic apps, request a quota increase before you hit the wall — approval can take 3-5 business days."
}
},
{
"@type": "Question",
"name": "What's the difference between 429 and 529 errors?",
"acceptedAnswer": {
"@type": "Answer",
"text": "A 429 means your account exceeded its quota — fix your rate limiter. A 529 means the provider itself is overloaded across all customers. Retry both with backoff, but expect 529s to persist longer during major incidents. Always have a fallback model ladder."
}
},
{
"@type": "Question",
"name": "Should I use a proxy like Helicone or LiteLLM for rate limiting?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Proxies make rate limiting easier to centralize but add 20-50ms latency per request and become a single point of failure. They are good for dev/staging; production fleets often prefer async ingestion that adds zero latency, like ClawPulse."
}
},
{
"@type": "Question",
"name": "Are embeddings rate-limited separately?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Yes, on every major provider. OpenAI's text-embedding-3-large has its own RPM/TPM. Azure splits embeddings into a separate deployment. Track embeddings TPM as an independent signal — RAG outages almost always come from this blind spot."
}
}
]
}
{
"@context": "https://schema.org",
"@type": "HowTo",
"name": "Build a Production-Grade Rate-Limit-Aware LLM Client",
"description": "End-to-end recipe for an LLM client that survives 429 storms, fairly shares quota across tenants, and recovers automatically when provider limits change.",
"totalTime": "PT4H",
"estimatedCost": {
"@type": "MonetaryAmount",
"currency": "USD",
"value": "0"
},
"tool": [
{"@type": "HowToTool", "name": "Python 3.11+ or Node 20+"},
{"@type": "HowToTool", "name": "Redis (for distributed deployments)"},
{"@type": "HowToTool", "name": "Prometheus + Grafana or ClawPulse"}
],
"step": [
{
"@type": "HowToStep",
"name": "Pre-flight token estimation",
"text": "Wrap every LLM call with a token estimator that includes system prompt overhead, tool schema cost, message boundary tokens, and multimodal content. Add a 12% safety buffer to absorb tokenizer drift."
},
{
"@type": "HowToStep",
"name": "Dual bucket (RPM + TPM)",
"text": "Implement a dual token-bucket limiter that refills both request and token counters per minute. Estimate tokens before acquiring; never send blind."
},
{
"@type": "HowToStep",
"name": "Distribute via Redis Lua",
"text": "If you run multiple replicas, replace local buckets with a Redis Lua script that atomically increments and checks the bucket so all workers share one counter."
},
{
"@type": "HowToStep",
"name": "Add AIMD concurrency control",
"text": "Wrap the bucket in an AIMD limiter that increases concurrency on success and halves on 429/529. This auto-discovers true headroom without you guessing."
},
{
"@type": "HowToStep",
"name": "Layer fair queueing for multi-tenant",
"text": "If multiple customers share the same key pool, add deficit weighted-fair queueing keyed by tenant. Cap per-tenant in-flight requests to prevent starvation."
},
{
"@type": "HowToStep",
"name": "Backoff with retry-after and jitter",
"text": "On 429/529, honor the retry-after header. Add 0-1000ms jitter, cap waits at 30s, and only retry idempotent statuses (429, 500, 502, 503, 504, 529)."
},
{
"@type": "HowToStep",
"name": "Enable prompt caching",
"text": "Mark long, stable system prompts as cached. Cached input tokens count at ~10% of normal TPM weight, effectively multiplying your throughput without a tier upgrade."
},
{
"@type": "HowToStep",
"name": "Instrument the OpenTelemetry schema",
"text": "Emit gen_ai.* span attributes plus tenant.id and route.name on every call. Mirror the same data into Prometheus counters and a partitioned SQL log for retention."
},
{
"@type": "HowToStep",
"name": "Alert on retry storms and time-to-exhaustion",
"text": "Set PromQL alerts for retry-to-success ratio above 0.5, time-to-quota-exhaustion below 30 seconds, and z-score-based 429 bursts per tenant."
},
{
"@type": "HowToStep",
"name": "Cancel in-flight on abandonment",
"text": "Pipe AbortSignal through to the provider stream. On user navigation away, abort the stream so unstreamed tokens stop billing and TPM is freed instantly."
}
]
}