guideMar 24, 202611 min read

Prompt Caching: The Single Change That Can Cut Your AI API Bill by 90%

OpenAI, Anthropic, and Google all offer prompt caching — but each works differently. Here's how to use them all, with real cost breakdowns.

Most developers are paying full price for the same tokens over and over again. If your application sends a system prompt, few-shot examples, or context documents with every API call, you're burning money on tokens the provider has already processed.

Prompt caching fixes this. It's the single highest-impact optimization available today — cutting input costs by up to 90% and latency by up to 85%. Yet most developers either don't know about it or haven't implemented it because every provider does it differently.

This guide breaks down exactly how prompt caching works across OpenAI, Anthropic, and Google, with real cost calculations so you can see exactly how much you'll save.

What Is Prompt Caching?

Every time you make an API call, the provider processes your entire prompt from scratch — system instructions, few-shot examples, conversation history, everything. Prompt caching tells the provider: "You've seen this part before. Don't reprocess it."

The provider stores a processed version of your prompt prefix. On subsequent requests with the same prefix, it skips the expensive computation and charges you a fraction of the normal input price.

The key insight: If 80% of your prompt is the same across requests (system prompt + examples + context), you're paying 80% too much on input tokens for every single call.

How Much Can You Actually Save?

Let's do the math. Say you're running a customer support bot using Claude Sonnet 4.6 that sends:

System prompt: 2,000 tokens
Few-shot examples: 3,000 tokens
Retrieved context: 4,000 tokens
User message: 500 tokens
Total per request: 9,500 input tokens

At 1,000 requests/day, that's 9.5 million input tokens/day.

Without Caching

Component	Tokens/day	Cost (Sonnet 4.6 @ $3/1M)
All input tokens	9,500,000	$28.50/day
Monthly input cost		$855.00

With Caching (Anthropic)

Component	Tokens/day	Rate	Cost
Cached tokens (system + examples + context)	9,000,000	$0.30/1M (90% off)	$2.70/day
Cache write (first request/5 min)	~2,600,000	$3.75/1M (1.25x)	$9.75/day
Uncached tokens (user messages)	500,000	$3.00/1M	$1.50/day
Monthly input cost			~$130

That's an 85% reduction — from $855/month to ~$130/month on input tokens alone. For a single endpoint.

Stack this with the Batch API (50% off) for non-real-time workloads, and you're looking at costs below $65/month for the same volume.

Provider Comparison: How Each One Works

Here's the thing — every provider implements caching differently. Some require zero code changes; others need explicit markup. The discounts vary. The minimums vary. Here's the full breakdown:

Feature	OpenAI	Anthropic	Google
Cache discount	50% off input	90% off input	75% off input
Cache write cost	Free (1x)	1.25x standard	1x standard
Min cacheable tokens	1,024	1,024 (Sonnet/Opus), 2,048 (Haiku)	32,768
TTL	~5-10 min (auto)	5 min (refreshes on hit)	1 hour (manual)
Code changes needed	None	Yes (`cache_control`)	Yes (`CachedContent`)
Cache granularity	128-token increments	Block-level	Whole context
Best for	Drop-in savings	Fine-grained control	Very large contexts

OpenAI: Zero-Effort Caching

OpenAI made the smart move of making caching automatic. If your prompt starts with 1,024+ identical tokens across requests, you get cached pricing — no code changes needed.

# Nothing special required — OpenAI caches automatically
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": your_long_system_prompt},  # Cached automatically
        {"role": "user", "content": user_input}  # Not cached
    ]
)

# Check response headers for cache hit info
# usage.prompt_tokens_details.cached_tokens tells you how many hit cache

What to know:

Caching kicks in at 1,024 tokens minimum
Cache matches in 128-token increments after the first 1,024
Cached tokens cost 50% less ($1.25/1M instead of $2.50/1M for GPT-4o)
No cache write surcharge — the first request costs the same as always
Cache evicts automatically after ~5-10 minutes of inactivity

Pro tip: Structure your messages so the static content always comes first. System prompt → few-shot examples → context → user message. Any change in the prefix breaks the cache for everything after it.

Anthropic: Maximum Control, Maximum Savings

Anthropic's caching requires explicit markup but delivers the deepest discount — 90% off cached reads. You mark specific blocks for caching using cache_control.

response = client.messages.create(
    model="claude-sonnet-4-6-20260514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,
            "cache_control": {"type": "ephemeral"}  # Cache this block
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": large_context_document,
                    "cache_control": {"type": "ephemeral"}  # Cache this too
                },
                {
                    "type": "text",
                    "text": "What are the key findings?"  # Not cached
                }
            ]
        }
    ]
)

# Check usage.cache_creation_input_tokens and usage.cache_read_input_tokens

What to know:

Cached reads cost 10% of standard price (e.g., $0.30/1M vs $3.00/1M for Sonnet 4.6)
Cache writes cost 25% more than standard (first time only)
Cache TTL is 5 minutes, refreshed on each hit
Minimum 1,024 tokens for Sonnet/Opus, 2,048 for Haiku
You can cache up to 4 blocks per request
Extended TTL (1 hour) available at 2x write cost

Pro tip: The 5-minute TTL means caching is most effective for applications with steady traffic. If you get at least one request every 5 minutes, your cache stays warm and you keep the 90% discount indefinitely.

Breaking update (March 14, 2026): Anthropic removed the long-context surcharge for Claude Opus 4.6 and Sonnet 4.6. Previously, requests over 200K tokens cost double. Now the standard rate applies at any context length — making caching even more valuable for large-context applications.

Google: Built for Large Contexts

Google's caching is designed for massive context windows (Gemini supports up to 2M tokens) but has the highest minimum token requirement.

# Step 1: Create a cached content object
cache = client.caching.create(
    model="gemini-2.5-pro",
    contents=[large_document_content],
    ttl="3600s"  # 1 hour
)

# Step 2: Use the cache in requests
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=["Summarize the key points"],
    cached_content=cache.name
)

What to know:

Cached tokens cost 75% less than standard input pricing
Minimum 32,768 tokens (32K) — much higher than OpenAI/Anthropic
TTL defaults to 1 hour, configurable
Requires explicit cache creation as a separate API call
Supports Gemini 2.5 Pro and Flash models

Pro tip: Google's high minimum makes it ideal for RAG applications where you're sending large document chunks. If your context is under 32K tokens, use OpenAI or Anthropic's caching instead.

Real-World Savings by Use Case

Here's what caching saves across common application patterns, using real March 2026 pricing:

Use Case	Model	Daily Calls	Without Cache	With Cache	Monthly Savings
Customer support bot	Claude Sonnet 4.6	1,000	$855/mo	$130/mo	$725
Code review assistant	GPT-4o	500	$412/mo	$225/mo	$187
Document Q&A (RAG)	Gemini 2.5 Pro	2,000	$2,250/mo	$675/mo	$1,575
Content classifier	GPT-4o-mini	5,000	$67/mo	$38/mo	$29
Legal document analyzer	Claude Opus 4.6	200	$900/mo	$145/mo	$755

Total across all use cases: $3,271/month in savings. That's the kind of number that pays for your entire AI infrastructure team.

5 Rules for Maximizing Cache Hit Rates

Getting caching set up is the easy part. Getting high cache hit rates is where the real savings come from.

1. Put Static Content First, Always

Cache matching works on prefixes. If your system prompt comes first, it gets cached. If you put dynamic content before static content, you break the cache.

✅ System prompt → Few-shot examples → Context docs → User message
❌ User message → System prompt → Context docs

2. Standardize Your System Prompts

Even a single character difference breaks the cache. If your system prompt includes a timestamp, date, or request ID — remove it or move it to the user message.

❌ "You are a helpful assistant. Today is March 24, 2026. ..."
✅ "You are a helpful assistant. ..." + user message includes date

3. Batch Similar Requests Together

If you have 50 classification tasks, send them in rapid succession. Each request refreshes the cache TTL, keeping it warm for the next one. Don't interleave different prompt templates.

4. Monitor Your Cache Hit Rate

You can't optimize what you can't measure. Every provider returns cache statistics in the response:

OpenAI: usage.prompt_tokens_details.cached_tokens
Anthropic: usage.cache_read_input_tokens and usage.cache_creation_input_tokens
Google: Cache hit/miss metadata in the response

Track your cache hit rate over time. If it drops below 70%, investigate why — you might have prompt drift, inconsistent formatting, or traffic gaps that let the cache expire.

Want to track cache hit rates alongside your total AI spend? AISpendGuard gives you per-call cost visibility with tag-based attribution — so you can see exactly which features are benefiting from caching and which are still paying full price.

5. Right-Size Your Cache TTL

High-traffic endpoints (10+ req/min): Default TTL is fine — the cache stays warm naturally
Medium-traffic endpoints (1-10 req/min): Consider Anthropic's extended TTL (1 hour at 2x write cost)
Low-traffic endpoints (<1 req/min): Caching may not help — the cache expires between requests. Focus on model selection instead

Combining Caching with Other Optimizations

Prompt caching is powerful on its own, but it stacks with other techniques for compound savings:

Strategy	Savings	Stacks With Caching?
Prompt caching	50-90% on input	—
Batch API	50% on input + output	Yes (up to 95% total)
Model downtier	40-95% per call	Yes
Prompt trimming	20-40% on input	Partially (less to cache)
Semantic caching (app-level)	100% on cache hits	Yes (avoids API calls entirely)

The most aggressive stack: Anthropic caching (90% off) + Batch API (50% off) = up to 95% reduction on input token costs. A call that would have cost $15 per million tokens now costs $0.75.

When Caching Doesn't Help

Be honest about when caching isn't the right optimization:

Short prompts (<1,024 tokens): Below the minimum threshold for all providers
Highly dynamic prompts: If every request has a unique prefix, nothing gets cached
Infrequent requests: If you're making fewer than 1 request per 5 minutes, the cache expires between calls
Output-heavy workloads: Caching only reduces input costs. If your costs are dominated by output tokens (e.g., long-form generation), focus on model selection or output limits instead

For these cases, check out our other guides on model selection and batch API savings.

Start Tracking Your Savings

Here's the hard truth: you can implement caching perfectly and still not know if it's working. Provider dashboards show you total spend, but they don't tell you:

Which features benefit most from caching
Whether your cache hit rate is improving or degrading
How caching savings compare across different models and endpoints
What your actual cost-per-feature is after all optimizations

This is exactly what AISpendGuard was built for. Tag your API calls by feature, customer tier, or model — and see exactly where every dollar goes. No prompts stored, no gateway required, no latency added.

See how much you could save → Try the cost calculator

TL;DR

Prompt caching cuts input costs by 50-90% depending on provider
OpenAI does it automatically — just structure prompts with static content first
Anthropic gives 90% off but requires cache_control markup
Google needs 32K+ tokens — best for large-context RAG applications
Stack with Batch API for up to 95% total savings
Monitor your cache hit rate — aim for 70%+ to justify the cache write costs
Track your actual savings with per-call attribution, not just aggregate dashboards

Start with the provider you use most. Structure your prompts correctly. Measure the results. The savings are real, immediate, and compound over time.

Start monitoring for free → Sign up for AISpendGuard

Prompt Caching: The Single Change That Can Cut Your AI API Bill by 90%

What Is Prompt Caching?

How Much Can You Actually Save?

Without Caching

With Caching (Anthropic)

Provider Comparison: How Each One Works

OpenAI: Zero-Effort Caching

Anthropic: Maximum Control, Maximum Savings

Google: Built for Large Contexts

Real-World Savings by Use Case

5 Rules for Maximizing Cache Hit Rates

1. Put Static Content First, Always

2. Standardize Your System Prompts

3. Batch Similar Requests Together

4. Monitor Your Cache Hit Rate

5. Right-Size Your Cache TTL

Combining Caching with Other Optimizations

When Caching Doesn't Help

Start Tracking Your Savings

TL;DR

Want to track your AI spend automatically?