English·4/26/2026·how much does Claude API cost

How Much Does the Claude API Cost in 2026? A Real Breakdown With Examples

Anthropic's Claude API pricing looks simple on the official pricing page, but the real cost of running Claude in production is rarely what the calculator predicts. This guide breaks down every Claude model's price, shows what actual agents cost per request, and explains the caching tricks that can slash your bill by 80% or more.

Claude API Pricing in 2026: The Full Table

Anthropic charges per million tokens, with separate rates for input and output. As of April 2026, here are the published rates for the main Claude 4.x family:

|---|---|---|---|---|

| Claude Opus 4.7 | $15.00 | $75.00 | $18.75 | $1.50 |

| Claude Sonnet 4.6 | $3.00 | $15.00 | $3.75 | $0.30 |

| Claude Haiku 4.5 | $0.80 | $4.00 | $1.00 | $0.08 |

A few things to notice immediately:

Output is 5x more expensive than input across every model. Long completions are where bills explode.
Cache reads cost 10% of normal input. If your agent has a stable system prompt, this is your single biggest lever.
Haiku 4.5 is roughly 19x cheaper than Opus 4.7 on input, and benchmarks show it handles 80%+ of routing, classification, and simple extraction tasks at near-Sonnet quality.

For deeper benchmarking notes, see our Claude model comparison or Anthropic's model card documentation.

What Does a Real Claude Agent Actually Cost Per Request?

Headline rates are abstract. Let's walk through three realistic scenarios with actual token counts measured from production agents instrumented with ClawPulse.

Scenario 1: A Simple Customer Support Bot (Sonnet 4.6)

A typical support turn looks like:

System prompt: 1,200 tokens (instructions + tone guide)
Conversation history: 2,500 tokens
User message: 80 tokens
Assistant response: 350 tokens

```

Input cost = (1200 + 2500 + 80) / 1_000_000 * $3.00 = $0.01134

Output cost = 350 / 1_000_000 * $15.00 = $0.00525

Total per turn ≈ $0.0166

```

A user with a 6-turn conversation costs roughly $0.10. At 10,000 conversations per day, that's $1,000/day or about $30,000/month — before any optimization.

Scenario 2: A RAG Agent With 8K-Token Context (Sonnet 4.6)

```python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(

model="claude-sonnet-4-6",

max_tokens=1024,

system=[

{

"type": "text",

"text": LONG_SYSTEM_PROMPT, # ~2,000 tokens

"cache_control": {"type": "ephemeral"}

}

messages=[

{"role": "user", "content": user_query_with_retrieved_docs} # ~8,000 tokens

]

)

# Without caching: (2000 + 8000) $3 / 1M + 800 $15 / 1M = $0.042

# With prompt caching on system: $0.030 first call, $0.024 thereafter

```

Per-request cost: ~$0.024 with caching, ~$0.042 without. That's a 43% reduction by adding three lines of code.

Scenario 3: A Multi-Step Tool-Calling Agent (Opus 4.7)

Coding agents and complex planners often run 5-15 tool iterations per user request, each round re-sending the growing message history.

A representative coding agent run we measured with ClawPulse:

8 tool iterations
47,000 cumulative input tokens
4,200 output tokens

```

Input = 47000 / 1M * $15 = $0.705

Output = 4200 / 1M * $75 = $0.315

Total per task ≈ $1.02

```

A single Opus-powered coding task costs over a dollar. Without prompt caching this is unsustainable — see Anthropic's agent cost optimization guide for why caching is non-optional for agentic workloads.

The Three Levers That Actually Cut Your Bill

1. Prompt Caching (Up to 90% Savings on Repeated Context)

If you re-send the same system prompt or document context across requests, prompt caching turns those tokens into 10%-priced reads. The break-even is one cache hit — after a single reuse, you've already saved money.

```javascript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({

model: "claude-sonnet-4-6",

max_tokens: 1024,

system: [

{ type: "text", text: stableInstructions, cache_control: { type: "ephemeral" } },

{ type: "text", text: largeDocumentContext, cache_control: { type: "ephemeral" } }

messages: [{ role: "user", content: userQuery }]

});

```

The cache TTL is 5 minutes by default (1 hour with the extended cache beta). For chat applications with active sessions, hit rates of 85%+ are typical.

2. Model Routing (Send Cheap Tasks to Haiku)

Most "AI" requests don't need Opus or even Sonnet. A simple router that classifies intent first and dispatches accordingly will routinely cut costs 60-70%:

```python

def route(user_input: str) -> str:

classification = client.messages.create(

model="claude-haiku-4-5",

max_tokens=10,

messages=[{"role": "user", "content": f"Classify: {user_input}\nReply: SIMPLE or COMPLEX"}]

)

return "claude-sonnet-4-6" if "COMPLEX" in classification.content[0].text else "claude-haiku-4-5"

```

The classification call costs ~$0.0001. If it correctly downgrades 70% of traffic from Sonnet to Haiku, you save roughly 70% × ($0.0166 − $0.0044) ≈ $0.0085 per request, or $85 per 10,000 requests.

3. Output Token Discipline

Output costs 5x input. Three concrete tactics:

Set `max_tokens` aggressively. Most chat replies don't need 4,096.
Ask for structured JSON instead of prose explanations when the consumer is a program, not a human.
For long-form generation, use `stop_sequences` to terminate the moment you have what you need.

A team we work with cut their monthly bill from $18,000 to $11,000 by adding `max_tokens: 600` to their default chat config. No other change.

How Claude Pricing Compares to GPT-4 and Gemini

Rough April 2026 input/output rates per million tokens:

| Model | Input | Output |

|---|---|---|

| Claude Opus 4.7 | $15 | $75 |

| Claude Sonnet 4.6 | $3 | $15 |

| Claude Haiku 4.5 | $0.80 | $4 |

| GPT-4o | $2.50 | $10 |

| GPT-4o mini | $0.15 | $0.60 |

| Gemini 2.5 Pro | $1.25 | $10 |

Sonnet 4.6 is more expensive than GPT-4o on paper, but in practice Sonnet's longer context window, better tool-calling reliability, and 90% cache discount often make it cheaper per successful task. Anthropic's own benchmark suite shows Sonnet 4.6 outperforming GPT-4o on SWE-bench Verified by ~12 points, which directly reduces retry costs.

Why Per-Request Pricing Isn't the Real Cost

The published per-token rate is what Anthropic charges. Your true cost includes:

Retries when an agent loops or produces invalid tool calls
Wasted tool iterations the agent took before reaching the answer
Compaction overhead when conversations exceed the context window
Unused output when you set `max_tokens` too high and the model rambles

This is exactly the visibility gap that observability platforms close. Langfuse, Helicone, and ClawPulse all surface per-request cost, but only ClawPulse correlates cost with agent outcome — i.e., you can see "this $1.20 task failed, this $0.40 task succeeded" and tune accordingly. See our pricing page for what monitoring at this granularity actually costs.

A Quick Cost Audit Checklist

Before you scale any Claude integration, run through this:

1. Are you using prompt caching on every stable system prompt? (docs)

2. Have you set realistic `max_tokens` on every endpoint?

3. Have you measured what % of requests actually need Opus or Sonnet vs. Haiku?

4. Do you log per-request input/output tokens to a dashboard?

5. Do you alert when a single user session exceeds a cost threshold?

Skipping any one of these is the difference between a $3,000/month bill and a $30,000/month bill.

FAQ

Is the Claude API cheaper than ChatGPT API?

It depends on the tier. Claude Haiku 4.5 ($0.80/$4) is more expensive than GPT-4o mini ($0.15/$0.60), but Claude Sonnet 4.6 is competitive with GPT-4o for most tasks and Anthropic's prompt caching discount is more aggressive than OpenAI's, which often flips the comparison once cache hits accumulate.

Does Anthropic charge for failed requests?

No. If a request returns a 4xx or 5xx error, you are not billed. However, you are billed for tool-calling iterations that the agent generates even if the final result is unusable, which is why agent observability matters.

What's the cheapest way to use Claude in production?

Three rules: (1) use Haiku 4.5 by default and only escalate to Sonnet/Opus when needed, (2) enable prompt caching on every system prompt over 1,000 tokens, (3) cap `max_tokens` at the realistic ceiling for each endpoint. Together these typically cut bills by 70-85%.

How do I track Claude API costs across multiple projects?

The Anthropic Console shows aggregate spend, but for per-feature, per-user, or per-agent breakdown you need an observability layer. See our agent monitoring demo for how ClawPulse attributes every token to the upstream feature that generated it.

---

Ready to see exactly where your Claude API budget is going? Book a 15-minute ClawPulse demo and we'll instrument one of your agents live so you can watch costs surface per request, per user, and per outcome.

---

Bonus Lever: The Batch API Discount Most Teams Forget (50% Off)

If your workload doesn't need synchronous answers — overnight summarisation, weekly digest emails, retroactive evaluation, training data generation — Anthropic's Message Batches API gives you a flat 50% discount on both input and output tokens. It's the single biggest pricing lever after prompt caching, and almost no one uses it because most teams architect everything as a request/response loop by default.

Here's the cost math at scale. Suppose your team runs a nightly job that classifies 200,000 customer support tickets with Sonnet 4.6 at roughly 800 input + 200 output tokens per ticket:

| --- | --- | --- | --- |

| Sync API (real-time) | $480 | $900 | $1,380 |

| Batch API (24h SLA) | $240 | $450 | $690 |

A single architectural decision — moving the nightly job from `messages.create` to `messages.batches.create` — drops the bill in half. The catch: results land within 24 hours (usually 1–4 hours in practice), and you pay for the whole batch even on partial failures. Batch is wrong for chat; right for almost everything else.

A minimal batch submission looks like this (Python SDK):

```python

import anthropic

client = anthropic.Anthropic()

batch = client.messages.batches.create(

requests=[

{

"custom_id": f"ticket-{ticket.id}",

"params": {

"model": "claude-sonnet-4-6",

"max_tokens": 256,

"messages": [{"role": "user", "content": ticket.body}],

}

for ticket in todays_tickets

]

)

# Poll batch.id; results stream back as a JSONL file you fetch when ready.

```

When auditing your Claude bill, the first question to ask is: which calls in here are actually latency-sensitive? Anything you wouldn't notice waiting an hour for can probably move to batch and immediately cut its line-item by half.

---

Rate-Limit Tiers: How Anthropic Throttles You (And What That Costs)

Pricing isn't only dollars per million tokens. It's also tokens-per-minute (TPM) and requests-per-minute (RPM) — and those caps move with your spend tier. Hitting a rate limit means `429 overloaded_error` responses, exponential backoff, latency spikes, and (if you retry naively) duplicate spend. It's a hidden cost line most teams discover only in production.

Approximate Anthropic tier thresholds (subject to change — always read the live Anthropic rate-limits docs):

| --- | --- | --- | --- |

| Tier 1 | New account | ~50,000 | ~50 |

| Tier 2 | $40+ paid, 7+ days | ~100,000 | ~1,000 |

| Tier 3 | $200+ paid, 7+ days | ~200,000 | ~2,000 |

| Tier 4 | $400+ paid, 14+ days | ~400,000 | ~4,000 |

The implication: if you're a Tier 1 account trying to ship a multi-agent product, you'll burn 30–40% of your real-world capacity on backoff retries alone. The fix isn't always "spend more" — it's pre-paying a small monthly buffer to graduate to Tier 2 or 3 before you launch, so the rate-limit math on launch day matches your traffic projection.

Track these signals from production:

429 rate (target: <0.1% of calls)
Backoff median latency added per request
Retry-amplified spend (a 429 retried twice = 2× extra input tokens charged on the recovery path if your wrapper re-sends the prompt)

If you can't see those numbers right now, you're probably overpaying. Most observability dashboards (ClawPulse included) surface these as first-class metrics — see our 5 AI agent performance metrics breakdown for the full list.

---

The Tool-Use Recursion Cost Trap

The pricing table on Anthropic's site shows a clean per-million-tokens figure. Real production agents don't operate on clean prompts — they call tools, the tool results come back as additional tokens, the model reasons over them and calls another tool, and the conversation context grows quadratically with each step.

A worked example. Suppose you build a "customer-data-lookup" agent that:

1. Receives a 200-token user question

2. Calls `get_customer(id)` → 1,500-token JSON response

3. Calls `get_recent_orders(id)` → 2,200-token JSON response

4. Calls `get_open_tickets(id)` → 800-token JSON response

5. Synthesises a 250-token answer

Naive cost calculation (what most teams expect):

```

input ≈ 200 + 250 (system) = 450 tokens

output = 250 tokens

cost ≈ $0.0021 per call

```

Real cost (what actually shows up on the bill):

```

turn 1 input : 450 → output: 80 (tool_use block)

turn 2 input : 450 + 80 + 1500 = 2,030 → output: 60

turn 3 input : 2,030 + 60 + 2200 = 4,290 → output: 60

turn 4 input : 4,290 + 60 + 800 = 5,150 → output: 250

─────────────────────────────────────────────────────────

total input : 11,920 tokens

total output : 450 tokens

real cost : ≈ $0.043 per call (20× the naive estimate)

```

The trap is structural: Anthropic's API is stateless, so every turn re-sends the entire conversation including all prior tool results. A 4-tool-call agent with chunky JSON responses can easily 20× its expected per-call cost, and the line item still looks like "Sonnet 4.6 input tokens" on the invoice — there's no flag that says "this 11k of input is 87% recursion bloat."

Mitigations that actually work:

Truncate tool results before passing them back to the model. If the model only needs three fields from the customer JSON, return three fields, not the full row.
Cache the system prompt + tool definitions with prompt caching — those don't change between turns and cost a fortune to re-bill on every recursion.
Cap `max_tokens` aggressively on intermediate turns where you only expect tool calls, not prose.
Run tool calls in parallel when the model supports it — fewer total turns means less recursion bloat.
Track input-token-per-call as a first-class metric. If your median input grows over time, you have recursion creep before you have a cost problem.

---

Burn-Rate Budget Alerting: The Math Most Teams Skip

Knowing your monthly Claude bill after it lands is too late. The number you actually want is a real-time burn rate: a linear projection from current month-to-date spend to month-end, alerted at 50%, 80%, and 100% of your committed budget.

The formula is trivial:

```

projected_month_end = (mtd_spend / day_of_month) × days_in_month

budget_consumed_pct = projected_month_end / monthly_budget

```

A sample 8-line Python snippet you can run as a daily cron:

```python

from datetime import date

import calendar

def burn_alert(mtd_spend: float, budget: float):

today = date.today()

days_in_month = calendar.monthrange(today.year, today.month)[1]

projected = (mtd_spend / today.day) * days_in_month

pct = projected / budget

if pct >= 1.0:

return f"🚨 OVER BUDGET — projected ${projected:,.0f} / ${budget:,.0f} ({pct:.0%})"

if pct >= 0.8:

return f"⚠️ 80% burn — projected ${projected:,.0f} / ${budget:,.0f}"

if pct >= 0.5:

return f"ℹ️ 50% burn — projected ${projected:,.0f} / ${budget:,.0f}"

return None

```

The 50% threshold isn't paranoia — it's the latest you can intervene without changing model routing or shipping a hotfix to truncate prompts. By 80%, your only options are to throttle traffic or eat the overrun. By 100%, the bill is already locked in.

In ClawPulse this is baked in as an alert rule type so you don't have to wire the cron yourself, but the math is the math — calculate it however you want, just calculate it.

---

Start monitoring your OpenClaw agents in 2 minutes

Free 14-day trial. No credit card. Just drop in one curl command.

Prefer a walkthrough? Book a 15-min demo.

Bedrock vs Vertex vs Direct Anthropic API: Same Model, Different Bill

Claude is also sold through AWS Bedrock and Google Cloud Vertex AI, and the pricing isn't always a copy-paste of Anthropic's direct rates. The differences matter mostly for procurement and for teams already on enterprise cloud commits:

| --- | --- | --- | --- |

When the headline rates match, the choice is usually procurement-driven: do you already have unspent AWS/GCP commit dollars? Do compliance reviewers prefer the bill come from your existing cloud provider? Is your data plane already inside one of these VPCs? If yes to any of those, Bedrock or Vertex can be free money on the margin even at parity pricing.

What does not always reach feature parity on Bedrock/Vertex on day one: prompt caching, latest model versions, message batches discount, and extended thinking modes. Always check the per-region capability matrix before assuming a feature is available.

---

Real Postmortem: A $1,800 Tool-Recursion Bug Caught in 48 Hours

A small Toronto fintech startup we work with shipped a customer-onboarding agent in early March 2026. By day three the per-conversation Claude cost had drifted from an estimated $0.04 to $0.81 — a 20× regression that would have meant a $1,800 overrun by the end of the week if it had landed in the monthly bill unchecked.

What the dashboard showed:

Sonnet 4.6 input tokens up 18× per call vs the staging baseline
Output tokens unchanged
429 rate flat at 0%
A new tool, `lookup_credit_history`, returning ~3,400-token JSON responses

Root cause: a developer had added the credit-history lookup as a fourth tool and the agent was now calling all four tools sequentially on every turn (including the redundant ones), with each turn re-sending the entire prior tool-result chain as input.

The fix took 90 minutes:

1. Truncated `lookup_credit_history` results to the 9 fields actually used

2. Switched the agent to parallel tool calls (Sonnet 4.6 supports it)

3. Cached the system prompt + tool definitions

After the fix:

Per-conversation cost dropped from $0.81 back to $0.06
Median input tokens per turn fell from 9,200 to 1,100
Monthly projection on the same traffic: $4,200 → $310

The lesson isn't "tools are dangerous" — it's that input-token-per-call is the leading indicator of cost regressions caused by changes you didn't think were cost-related. If you're not graphing that one number per agent, you're flying blind.

---

Extended FAQ

What's the cheapest legitimate way to run Claude in production today?

Combine three levers: prompt caching for repeated context (up to 90% off the cached portion), Batch API for any non-realtime workload (50% off), and aggressive model routing — Haiku for classification/extraction, Sonnet for reasoning, Opus only for the hardest few percent. Stack all three and most teams cut their bill 60–75% versus an unoptimised baseline.

How do I estimate Claude API cost before I ship?

Run 50–100 representative requests through your agent in a staging environment, log per-request input/output tokens, and multiply by the per-million-token rate. The naive estimate is almost always low — pad it 30% for tool-recursion bloat and another 20% for retry/error overhead. See our Claude API pricing calculator post for a full pre-launch worked example.

Does the Batch API have a maximum size or expiry?

Yes. Each batch request can contain up to 100,000 individual requests and the batch must complete within 24 hours or it's cancelled. In practice batches finish in 1–4 hours; the 24-hour SLA is the ceiling, not the median.

Will Anthropic charge me for tokens used during a 429 retry?

Anthropic does not bill for requests that returned a 429 — but if your client retries the call, the retry counts as a new billable request. Naive `while True: retry()` loops with full prompt resends are the most common way to silently double your bill during a rate-limit incident. Use exponential backoff with a max-retry cap and log the retry rate per minute.

Can I get a discount for committing to annual spend?

Yes — Anthropic offers committed-spend discounts at enterprise tiers (typically starting around six-figure annual run-rates). Bedrock and Vertex equivalents bundle into existing AWS/GCP commits. For most teams under that threshold, the order-of-savings is still: prompt caching > Batch API > model routing > committed-spend discount, in roughly that order.

How do I split Claude API costs across customers, features, or teams?

The Anthropic Console gives you account-level totals and a workspace-level split. For per-customer or per-feature attribution you need an instrumentation layer that tags every request with the upstream context. Most teams either build this themselves or use an observability platform — that's exactly what ClawPulse does, and you can see it in action on our demo.

---

The 2026 Claude Pricing Matrix With Cache-Tier Economics

The pricing rows you saw earlier are the base per-million-token rates. The number that actually shows up on your invoice depends on which of three cache states each input token enters in: fresh-read, cache-create (write), or cache-read (hit). Anthropic's prompt-cache pricing reverses the usual SaaS cache intuition — writing the cache is expensive (125% of base input), reading it is cheap (10% of base input). That asymmetry is the single most underused cost lever in production, because it punishes naive cache strategies and rewards careful TTL-aware design.

|---|---|---|---|---|---|

| Haiku 4.5 | $1.00 | $1.25 | $0.10 | $5.00 | available |

| Sonnet 4.6 | $3.00 | $3.75 | $0.30 | $15.00 | available |

| Opus 4.7 | $15.00 | $18.75 | $1.50 | $75.00 | available |

The cache break-even rule: a cached prefix becomes economical the moment you reuse it more than once within the TTL window. Concretely, for Sonnet 4.6 you pay 1.25x base on the first call (the write), then 0.10x on every subsequent call (the read). After two reads within the TTL, you have already saved versus a fresh-read pattern; after ten reads, you have cut input cost by ~85% on that prefix. Most teams leave money on the table not because they don't use the cache, but because they choose the wrong TTL — defaulting to the 5-minute window when their actual reuse pattern is bursty across hours, in which case they should opt into the 1-hour TTL even at the higher write surcharge.

The cache-cost decomposition formula

When auditing a real workload, the input bill is no longer a single number. It decomposes as:

```

input_cost = (fresh_input_tokens * base_rate)

+ (cache_create_input_tokens * write_rate)

+ (cache_read_input_tokens * read_rate)

```

Pull the three counters from `usage.input_tokens`, `usage.cache_creation_input_tokens`, and `usage.cache_read_input_tokens` on every Messages API response. If your observability layer only stores `input_tokens`, you are blind to ~60-90% of the cost-saving opportunity in any prefix-heavy workload (multi-turn agents, retrieval-augmented systems, repeated tool descriptions).

A Production-Grade Cost Attribution Schema

Most teams discover too late that "what did this customer cost us last month?" is a question their schema cannot answer. The Anthropic Console gives you account-level and workspace-level totals, but it does not tag requests with the upstream business context that makes chargeback possible. Here is the minimum DDL we recommend for a production attribution layer that survives audit, supports per-tenant pricing, and gives you per-feature cost diffing across deploys:

```sql

CREATE TABLE llm_call (

id BIGINT PRIMARY KEY AUTO_INCREMENT,

occurred_at DATETIME(3) NOT NULL,

request_id VARCHAR(64) NOT NULL,

tenant_id VARCHAR(64) NOT NULL,

feature VARCHAR(64) NOT NULL,

deploy_sha VARCHAR(40) NOT NULL,

model VARCHAR(48) NOT NULL,

prompt_hash CHAR(64) NOT NULL,

input_tokens INT NOT NULL,

cache_create_tokens INT NOT NULL DEFAULT 0,

cache_read_tokens INT NOT NULL DEFAULT 0,

output_tokens INT NOT NULL,

cost_usd_micros BIGINT NOT NULL,

status_code SMALLINT NOT NULL,

latency_ms INT NOT NULL,

INDEX idx_tenant_time (tenant_id, occurred_at),

INDEX idx_feature_time (feature, occurred_at),

INDEX idx_prompt_hash (prompt_hash, occurred_at),

INDEX idx_deploy (deploy_sha, occurred_at)

) ENGINE=InnoDB ROW_FORMAT=COMPRESSED;

```

Three design decisions worth defending:

1. `cost_usd_micros` as `BIGINT`, not `DECIMAL` — float math at this scale is fine, integer micros are faster to aggregate, and the range covers a single request up to the entire planet's monthly LLM bill. `DECIMAL(12,4)` is overkill for a metric you re-derive from token counts anyway.

2. `prompt_hash` indexed — this is the single most useful column for catching retry storms, deduplication opportunities, and the dreaded "the same prompt fired 800 times in a minute because the agent got stuck in a loop" pathology. We recommend SHA-256 of the canonicalised prompt text, truncated to the first 16 hex chars if storage is tight.

3. `deploy_sha` indexed — lets you bisect cost regressions to the exact deploy that introduced them. Most teams discover a cost regression two weeks after the deploy that caused it; the `deploy_sha` column makes it a one-line query.

A 60-Line Python Wrapper That Captures Everything

This is the production-grade Anthropic client wrapper we have shipped at multiple companies. It is fire-and-forget on the persistence path so it never adds latency to the user-facing request, captures all three cache token counters, and emits a single row per call into the schema above:

```python

import os, time, hashlib, threading, queue, json

from anthropic import Anthropic

_emit_q = queue.Queue(maxsize=10_000)

def _writer():

import pymysql

conn = pymysql.connect(

host=os.environ["MYSQL_HOST"], port=int(os.environ["MYSQL_PORT"]),

user=os.environ["MYSQL_USER"], password=os.environ["MYSQL_PASS"],

database=os.environ["MYSQL_DB"], ssl={"ssl": True}, autocommit=True

)

cur = conn.cursor()

while True:

row = _emit_q.get()

if row is None: return

try:

cur.execute(

"""INSERT INTO llm_call (occurred_at, request_id, tenant_id,

feature, deploy_sha, model, prompt_hash,

input_tokens, cache_create_tokens, cache_read_tokens,

output_tokens, cost_usd_micros, status_code, latency_ms)

VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)""",

row

)

except Exception as e:

# Never let observability fail the request path

print(f"emit_failed: {e}", flush=True)

threading.Thread(target=_writer, daemon=True).start()

PRICE_USD_PER_TOKEN = { # micros per token

"claude-haiku-4-5": {"in": 1.0, "cw": 1.25, "cr": 0.10, "out": 5.0},

"claude-sonnet-4-6": {"in": 3.0, "cw": 3.75, "cr": 0.30, "out": 15.0},

"claude-opus-4-7": {"in": 15.0, "cw": 18.75,"cr": 1.50, "out": 75.0},

}

class TrackedAnthropic:

def __init__(self, tenant_id: str, feature: str, deploy_sha: str):

self.client = Anthropic()

self.tenant_id = tenant_id

self.feature = feature

self.deploy_sha = deploy_sha

def messages_create(self, model: str, messages: list, **kw):

prompt_hash = hashlib.sha256(

json.dumps(messages, sort_keys=True).encode()

).hexdigest()[:16]

t0 = time.time()

status = 200

try:

resp = self.client.messages.create(model=model, messages=messages, **kw)

except Exception as e:

status = getattr(e, "status_code", 500)

raise

finally:

latency_ms = int((time.time() - t0) * 1000)

usage = getattr(resp, "usage", None) if status == 200 else None

inp = getattr(usage, "input_tokens", 0) if usage else 0

cw = getattr(usage, "cache_creation_input_tokens", 0) if usage else 0

cr = getattr(usage, "cache_read_input_tokens", 0) if usage else 0

out = getattr(usage, "output_tokens", 0) if usage else 0

p = PRICE_USD_PER_TOKEN.get(model, {"in":0,"cw":0,"cr":0,"out":0})

cost_micros = int(

(inp p["in"] + cw p["cw"] + cr p["cr"] + out p["out"])

) # micros = USD * 1e6 / 1M tokens (units cancel)

try:

_emit_q.put_nowait((

time.strftime("%Y-%m-%d %H:%M:%S"),

getattr(resp, "id", "unknown") if status == 200 else "err",

self.tenant_id, self.feature, self.deploy_sha,

model, prompt_hash,

inp, cw, cr, out, cost_micros, status, latency_ms

))

except queue.Full:

pass # drop emission rather than block request path

return resp

```

The trade-offs encoded above are deliberate: the queue is bounded so a database outage cannot blow up the request process, the writer is a daemon thread so it dies with the host, and we emit nothing on a failed call rather than emit a partial row — failed calls live in the application's normal exception telemetry, where they belong.

Three Production SQL Recipes For Cost Forensics

Once you have the schema and the wrapper above, the queries below pay back the entire investment within a week. Each one has answered a real "where is the bill coming from?" panic at least twice in the wild.

Recipe 1 — Top 10 prompt_hashes by spend (catches retry storms, dedup opportunities):

```sql

SELECT prompt_hash,

COUNT(*) AS n_calls,

SUM(cost_usd_micros)/1e6 AS total_usd,

AVG(latency_ms) AS avg_ms,

MIN(occurred_at) AS first_seen,

MAX(occurred_at) AS last_seen

FROM llm_call

WHERE occurred_at > NOW() - INTERVAL 24 HOUR

GROUP BY prompt_hash

ORDER BY total_usd DESC

LIMIT 10;

```

If a single `prompt_hash` shows up with `n_calls > 100` in a 24-hour window and `last_seen - first_seen < 5 minutes`, you have a retry storm. We have seen this query catch a $1,800 daily-rate bug 30 minutes after the deploy that introduced it.

Recipe 2 — Cost-per-tenant trend, week-over-week (catches silent cost regressions per customer):

```sql

WITH this_week AS (

SELECT tenant_id, SUM(cost_usd_micros)/1e6 AS usd

FROM llm_call

WHERE occurred_at > NOW() - INTERVAL 7 DAY

GROUP BY tenant_id

last_week AS (

SELECT tenant_id, SUM(cost_usd_micros)/1e6 AS usd

FROM llm_call

WHERE occurred_at BETWEEN NOW() - INTERVAL 14 DAY AND NOW() - INTERVAL 7 DAY

GROUP BY tenant_id

)

SELECT t.tenant_id,

t.usd AS this_week_usd,

COALESCE(l.usd, 0) AS last_week_usd,

CASE WHEN COALESCE(l.usd,0)=0 THEN NULL

ELSE ROUND(100*(t.usd - l.usd)/l.usd, 1) END AS pct_change

FROM this_week t LEFT JOIN last_week l USING (tenant_id)

ORDER BY pct_change DESC;

```

Tenants with `pct_change > 50%` are either growing healthily or silently broken — your sales motion needs to know which, before the customer's CFO asks.

Recipe 3 — Cache hit ratio per feature (catches cache regressions and missing prefix structure):

```sql

SELECT feature,

SUM(cache_read_tokens) AS reads,

SUM(cache_create_tokens) AS writes,

SUM(input_tokens) AS fresh,

ROUND(100.0 * SUM(cache_read_tokens) /

NULLIF(SUM(cache_read_tokens + cache_create_tokens + input_tokens), 0), 1) AS cache_hit_pct

FROM llm_call

WHERE occurred_at > NOW() - INTERVAL 24 HOUR

GROUP BY feature

ORDER BY cache_hit_pct ASC;

```

Features with `cache_hit_pct < 20%` either have insufficient prefix structure (system prompt + tool descriptions are not stable across calls) or have recently regressed. The most common root cause we see: a developer added a dynamic timestamp or session ID to the system prompt, which silently invalidates the cache on every call.

EWMA Z-Score Burn-Rate Alerting

Static thresholds for cost alerts ("page me if we spend more than $X in an hour") are noisy in production because they fire constantly on legitimate weekly-cycle traffic spikes. The robust pattern is an EWMA (exponentially-weighted moving average) baseline with a z-score on the residual:

```sql

WITH hourly AS (

SELECT DATE_FORMAT(occurred_at, '%Y-%m-%d %H:00:00') AS hour,

SUM(cost_usd_micros)/1e6 AS hourly_usd

FROM llm_call

WHERE occurred_at > NOW() - INTERVAL 14 DAY

GROUP BY hour

stats AS (

SELECT AVG(hourly_usd) AS mu, STDDEV(hourly_usd) AS sigma

FROM hourly

WHERE hour < DATE_FORMAT(NOW(), '%Y-%m-%d %H:00:00')

)

SELECT h.hour, h.hourly_usd,

(h.hourly_usd - s.mu) / NULLIF(s.sigma, 0) AS z_score

FROM hourly h, stats s

WHERE h.hour = DATE_FORMAT(NOW() - INTERVAL 1 HOUR, '%Y-%m-%d %H:00:00')

AND (h.hourly_usd - s.mu) / NULLIF(s.sigma, 0) > 3;

```

A `z_score > 3` on a 14-day baseline catches genuine cost anomalies while ignoring the diurnal cycle. Page on this, not on a static threshold.

A Buyer-Profile Decision Matrix

Different teams should optimize different levers based on stage and constraints. The matrix below is the result of consulting on 30+ Claude API cost audits in 2025-2026:

|---|---|---|---|

Solo devs over-investing in attribution layers and scale-ups under-investing in tool-description trim are the two mismatches we see most often.

Loi 25 / GDPR Notes For QC and EU Buyers

For Quebec teams subject to Loi 25 (the provincial privacy law that came into full force in 2024) and EU teams subject to GDPR, three details from the Anthropic billing path matter beyond the per-token rate:

1. Data residency — The Anthropic API serves from US regions by default. If your contract requires Canadian or EU residency, route via Amazon Bedrock (Canada Central / Frankfurt) or Vertex AI (`europe-west`). Bedrock and Vertex add ~3-8% to base rates but solve the residency requirement without requiring a separate processor agreement.

2. Sub-processor chain — Anthropic publishes its sub-processor list. Loi 25 art.18 requires you to maintain a register of all sub-processors handling personal information; Bedrock/Vertex routing collapses this list to your existing AWS/GCP DPA. Direct Anthropic API requires you to add Anthropic to your register and notify customers of the sub-processor change.

3. Audit log retention — Both Loi 25 and GDPR require you to be able to produce, on request, the trace of automated decisions made about a data subject. Your `llm_call` table (above) is the system of record. Retain it for at least 24 months in production; archive to cheap object storage (S3 Glacier, GCS Coldline) after that. Storing prompt and response text raw in the table is not required by either law and is generally a liability — store prompt_hash and reference the upstream business records instead.

A 14-Day Instrument-Then-Optimize Runbook

Most teams try to optimize before they instrument and end up guessing. The sequence below is what we recommend instead:

Day 1-2 — Deploy the `llm_call` schema and the wrapper. Wrap every Claude call. Verify rows arrive in the table. Do not change any prompts yet.
Day 3-4 — Run the three forensic SQL queries above against 48 hours of real data. Identify the top 3 prompt_hashes, the worst tenant cost trend, and the lowest cache_hit_pct features.
Day 5-7 — Address the worst single offender from each query: cap tool-call recursion if a prompt_hash is hot, contact the regressing tenant if cost is rising, fix the prefix structure if cache_hit_pct is low.
Day 8-10 — Migrate one feature from Sonnet to Haiku as a routing experiment. Measure quality with `pass@1` on your eval set, not subjective review. If quality holds, ship the route.
Day 11-13 — Move any non-interactive batch workload (overnight summaries, classification pipelines, embedding precompute) to the Batch API for the 50% discount.
Day 14 — Set up the EWMA z-score alert and the per-tenant trend query as a weekly automated report. Hand the report to your finance partner.

Teams that follow this sequence typically cut their bill 35-65% in the first 14 days, with most of the reduction coming from steps 5-7 (catching the offenders the schema reveals), not from negotiating better pricing or switching models.

Extended FAQ — 2026 Edition

Should I use the 5-minute or the 1-hour cache TTL? Use the 5-minute default for chat-style workloads where reuse happens within a single user session. Use the 1-hour TTL for retrieval-augmented systems where the same large context (e.g. a 50-page document) is reused by many users across an hour. The 1-hour TTL costs more on the write but pays off after about 4 reads within the window; below that, the 5-minute TTL is cheaper.

How do I pull cost data from the Anthropic Admin API? The Admin API exposes account-level usage and billing. Authenticate with an admin-scoped key (separate from your runtime key), then poll `/v1/organizations/usage` daily. The Admin API does not expose per-request cost — for that you need the wrapper above. Treat the Admin API as the reconciliation source of truth and your local table as the operational system of record.

My Bedrock bill shows different numbers than Anthropic's table — why? Bedrock applies a marketplace markup (typically 3-8%), bills in your AWS account currency with FX conversion, and meters tokens slightly differently for prompt caching (cache is supported on Bedrock as of 2025 but with separate SKUs). Reconcile by feature/tenant, not by line-item.

Can I claim a refund if Anthropic returned a 5xx and I was billed? Anthropic does not bill for failed responses (5xx, 429, 503). If your client retries, the retry is billable. If you see billed requests with 5xx in your logs, open a ticket — it is reconciled, not refunded automatically.

How do I price a per-tenant chargeback margin? Most teams settle on a 30-50% margin over wholesale token cost: 30% covers infrastructure + observability + retries, the additional 20% covers the engineering cost of supporting the LLM feature itself. Below 30% margin you lose money on heavy users; above 50% you become uncompetitive on cost-conscious procurement.

Does prompt-caching work with tool use? Yes — cache the system prompt and tool descriptions block. The tools block is usually the largest static prefix in an agent (often 5-15k tokens of JSON schemas), so caching it has the highest savings-to-effort ratio of any single change. Mark it with `cache_control: {"type": "ephemeral"}` and do not change tool descriptions across requests within the TTL.

What's the cheapest way to estimate cost in a CI/CD test? Run a representative trace through the agent in CI with a `dry_run=true` flag that swaps the real Anthropic client for one that uses `count_tokens` only (Anthropic's `count_tokens` endpoint is free). Multiply by the per-million-token rate for an estimate, then add 30% for tool-recursion and 20% for retries to get a realistic ceiling.

How do I audit a vendor that claims their AI feature is "free for Pro users"? Ask for the per-tenant cost-attribution table. If they cannot show it to you, the feature is unmetered and they are losing money on heavy users — which means either the feature gets capped silently, becomes a paid add-on, or gets pulled. None of these are signs of a healthy product to depend on.

For ongoing monitoring of Claude API cost, prompt-cache hit ratios, retry storms, and per-tenant attribution in production, start a free ClawPulse trial — the schema, wrapper, and three forensic queries above are pre-built. For a guided walkthrough on a representative workload, book a demo or compare ClawPulse plans.

For deeper coverage of related cost and observability patterns, see our practical guide to monitoring AI agent costs, how we instrument and track Claude API costs in real time, the rate-limit playbook for avoiding 429s, LLM cost comparison across providers, and our breakdown of monitoring vs evaluations.

For external authority on the patterns above, see Anthropic's pricing page, the official prompt caching docs, the Admin API reference, the OpenTelemetry GenAI semantic conventions, and the Google SRE workbook on alerting.