Most developers are paying full price for the same tokens over and over again. If your application sends a system prompt, few-shot examples, or context documents with every API call, you're burning money on tokens the provider has already processed.
Prompt caching fixes this. It's the single highest-impact optimization available today — cutting input costs by up to 90% and latency by up to 85%. Yet most developers either don't know about it or haven't implemented it because every provider does it differently.
This guide breaks down exactly how prompt caching works across OpenAI, Anthropic, and Google, with real cost calculations so you can see exactly how much you'll save.
What Is Prompt Caching?
Every time you make an API call, the provider processes your entire prompt from scratch — system instructions, few-shot examples, conversation history, everything. Prompt caching tells the provider: "You've seen this part before. Don't reprocess it."
The provider stores a processed version of your prompt prefix. On subsequent requests with the same prefix, it skips the expensive computation and charges you a fraction of the normal input price.
The key insight: If 80% of your prompt is the same across requests (system prompt + examples + context), you're paying 80% too much on input tokens for every single call.
How Much Can You Actually Save?
Let's do the math. Say you're running a customer support bot using Claude Sonnet 4.6 that sends:
- System prompt: 2,000 tokens
- Few-shot examples: 3,000 tokens
- Retrieved context: 4,000 tokens
- User message: 500 tokens
- Total per request: 9,500 input tokens
At 1,000 requests/day, that's 9.5 million input tokens/day.
Without Caching
| Component | Tokens/day | Cost (Sonnet 4.6 @ $3/1M) |
|---|---|---|
| All input tokens | 9,500,000 | $28.50/day |
| Monthly input cost | $855.00 |
With Caching (Anthropic)
| Component | Tokens/day | Rate | Cost |
|---|---|---|---|
| Cached tokens (system + examples + context) | 9,000,000 | $0.30/1M (90% off) | $2.70/day |
| Cache write (first request/5 min) | ~2,600,000 | $3.75/1M (1.25x) | $9.75/day |
| Uncached tokens (user messages) | 500,000 | $3.00/1M | $1.50/day |
| Monthly input cost | ~$130 |
That's an 85% reduction — from $855/month to ~$130/month on input tokens alone. For a single endpoint.
Stack this with the Batch API (50% off) for non-real-time workloads, and you're looking at costs below $65/month for the same volume.
Provider Comparison: How Each One Works
Here's the thing — every provider implements caching differently. Some require zero code changes; others need explicit markup. The discounts vary. The minimums vary. Here's the full breakdown:
| Feature | OpenAI | Anthropic | |
|---|---|---|---|
| Cache discount | 50% off input | 90% off input | 75% off input |
| Cache write cost | Free (1x) | 1.25x standard | 1x standard |
| Min cacheable tokens | 1,024 | 1,024 (Sonnet/Opus), 2,048 (Haiku) | 32,768 |
| TTL | ~5-10 min (auto) | 5 min (refreshes on hit) | 1 hour (manual) |
| Code changes needed | None | Yes (cache_control) | Yes (CachedContent) |
| Cache granularity | 128-token increments | Block-level | Whole context |
| Best for | Drop-in savings | Fine-grained control | Very large contexts |
OpenAI: Zero-Effort Caching
OpenAI made the smart move of making caching automatic. If your prompt starts with 1,024+ identical tokens across requests, you get cached pricing — no code changes needed.
# Nothing special required — OpenAI caches automatically
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": your_long_system_prompt}, # Cached automatically
{"role": "user", "content": user_input} # Not cached
]
)
# Check response headers for cache hit info
# usage.prompt_tokens_details.cached_tokens tells you how many hit cache
What to know:
- Caching kicks in at 1,024 tokens minimum
- Cache matches in 128-token increments after the first 1,024
- Cached tokens cost 50% less ($1.25/1M instead of $2.50/1M for GPT-4o)
- No cache write surcharge — the first request costs the same as always
- Cache evicts automatically after ~5-10 minutes of inactivity
Pro tip: Structure your messages so the static content always comes first. System prompt → few-shot examples → context → user message. Any change in the prefix breaks the cache for everything after it.
Anthropic: Maximum Control, Maximum Savings
Anthropic's caching requires explicit markup but delivers the deepest discount — 90% off cached reads. You mark specific blocks for caching using cache_control.
response = client.messages.create(
model="claude-sonnet-4-6-20260514",
max_tokens=1024,
system=[
{
"type": "text",
"text": your_long_system_prompt,
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": large_context_document,
"cache_control": {"type": "ephemeral"} # Cache this too
},
{
"type": "text",
"text": "What are the key findings?" # Not cached
}
]
}
]
)
# Check usage.cache_creation_input_tokens and usage.cache_read_input_tokens
What to know:
- Cached reads cost 10% of standard price (e.g., $0.30/1M vs $3.00/1M for Sonnet 4.6)
- Cache writes cost 25% more than standard (first time only)
- Cache TTL is 5 minutes, refreshed on each hit
- Minimum 1,024 tokens for Sonnet/Opus, 2,048 for Haiku
- You can cache up to 4 blocks per request
- Extended TTL (1 hour) available at 2x write cost
Pro tip: The 5-minute TTL means caching is most effective for applications with steady traffic. If you get at least one request every 5 minutes, your cache stays warm and you keep the 90% discount indefinitely.
Breaking update (March 14, 2026): Anthropic removed the long-context surcharge for Claude Opus 4.6 and Sonnet 4.6. Previously, requests over 200K tokens cost double. Now the standard rate applies at any context length — making caching even more valuable for large-context applications.
Google: Built for Large Contexts
Google's caching is designed for massive context windows (Gemini supports up to 2M tokens) but has the highest minimum token requirement.
# Step 1: Create a cached content object
cache = client.caching.create(
model="gemini-2.5-pro",
contents=[large_document_content],
ttl="3600s" # 1 hour
)
# Step 2: Use the cache in requests
response = client.models.generate_content(
model="gemini-2.5-pro",
contents=["Summarize the key points"],
cached_content=cache.name
)
What to know:
- Cached tokens cost 75% less than standard input pricing
- Minimum 32,768 tokens (32K) — much higher than OpenAI/Anthropic
- TTL defaults to 1 hour, configurable
- Requires explicit cache creation as a separate API call
- Supports Gemini 2.5 Pro and Flash models
Pro tip: Google's high minimum makes it ideal for RAG applications where you're sending large document chunks. If your context is under 32K tokens, use OpenAI or Anthropic's caching instead.
Real-World Savings by Use Case
Here's what caching saves across common application patterns, using real March 2026 pricing:
| Use Case | Model | Daily Calls | Without Cache | With Cache | Monthly Savings |
|---|---|---|---|---|---|
| Customer support bot | Claude Sonnet 4.6 | 1,000 | $855/mo | $130/mo | $725 |
| Code review assistant | GPT-4o | 500 | $412/mo | $225/mo | $187 |
| Document Q&A (RAG) | Gemini 2.5 Pro | 2,000 | $2,250/mo | $675/mo | $1,575 |
| Content classifier | GPT-4o-mini | 5,000 | $67/mo | $38/mo | $29 |
| Legal document analyzer | Claude Opus 4.6 | 200 | $900/mo | $145/mo | $755 |
Total across all use cases: $3,271/month in savings. That's the kind of number that pays for your entire AI infrastructure team.
5 Rules for Maximizing Cache Hit Rates
Getting caching set up is the easy part. Getting high cache hit rates is where the real savings come from.
1. Put Static Content First, Always
Cache matching works on prefixes. If your system prompt comes first, it gets cached. If you put dynamic content before static content, you break the cache.
✅ System prompt → Few-shot examples → Context docs → User message
❌ User message → System prompt → Context docs
2. Standardize Your System Prompts
Even a single character difference breaks the cache. If your system prompt includes a timestamp, date, or request ID — remove it or move it to the user message.
❌ "You are a helpful assistant. Today is March 24, 2026. ..."
✅ "You are a helpful assistant. ..." + user message includes date
3. Batch Similar Requests Together
If you have 50 classification tasks, send them in rapid succession. Each request refreshes the cache TTL, keeping it warm for the next one. Don't interleave different prompt templates.
4. Monitor Your Cache Hit Rate
You can't optimize what you can't measure. Every provider returns cache statistics in the response:
- OpenAI:
usage.prompt_tokens_details.cached_tokens - Anthropic:
usage.cache_read_input_tokensandusage.cache_creation_input_tokens - Google: Cache hit/miss metadata in the response
Track your cache hit rate over time. If it drops below 70%, investigate why — you might have prompt drift, inconsistent formatting, or traffic gaps that let the cache expire.
Want to track cache hit rates alongside your total AI spend? AISpendGuard gives you per-call cost visibility with tag-based attribution — so you can see exactly which features are benefiting from caching and which are still paying full price.
5. Right-Size Your Cache TTL
- High-traffic endpoints (10+ req/min): Default TTL is fine — the cache stays warm naturally
- Medium-traffic endpoints (1-10 req/min): Consider Anthropic's extended TTL (1 hour at 2x write cost)
- Low-traffic endpoints (<1 req/min): Caching may not help — the cache expires between requests. Focus on model selection instead
Combining Caching with Other Optimizations
Prompt caching is powerful on its own, but it stacks with other techniques for compound savings:
| Strategy | Savings | Stacks With Caching? |
|---|---|---|
| Prompt caching | 50-90% on input | — |
| Batch API | 50% on input + output | Yes (up to 95% total) |
| Model downtier | 40-95% per call | Yes |
| Prompt trimming | 20-40% on input | Partially (less to cache) |
| Semantic caching (app-level) | 100% on cache hits | Yes (avoids API calls entirely) |
The most aggressive stack: Anthropic caching (90% off) + Batch API (50% off) = up to 95% reduction on input token costs. A call that would have cost $15 per million tokens now costs $0.75.
When Caching Doesn't Help
Be honest about when caching isn't the right optimization:
- Short prompts (<1,024 tokens): Below the minimum threshold for all providers
- Highly dynamic prompts: If every request has a unique prefix, nothing gets cached
- Infrequent requests: If you're making fewer than 1 request per 5 minutes, the cache expires between calls
- Output-heavy workloads: Caching only reduces input costs. If your costs are dominated by output tokens (e.g., long-form generation), focus on model selection or output limits instead
For these cases, check out our other guides on model selection and batch API savings.
Start Tracking Your Savings
Here's the hard truth: you can implement caching perfectly and still not know if it's working. Provider dashboards show you total spend, but they don't tell you:
- Which features benefit most from caching
- Whether your cache hit rate is improving or degrading
- How caching savings compare across different models and endpoints
- What your actual cost-per-feature is after all optimizations
This is exactly what AISpendGuard was built for. Tag your API calls by feature, customer tier, or model — and see exactly where every dollar goes. No prompts stored, no gateway required, no latency added.
See how much you could save → Try the cost calculator
TL;DR
- Prompt caching cuts input costs by 50-90% depending on provider
- OpenAI does it automatically — just structure prompts with static content first
- Anthropic gives 90% off but requires
cache_controlmarkup - Google needs 32K+ tokens — best for large-context RAG applications
- Stack with Batch API for up to 95% total savings
- Monitor your cache hit rate — aim for 70%+ to justify the cache write costs
- Track your actual savings with per-call attribution, not just aggregate dashboards
Start with the provider you use most. Structure your prompts correctly. Measure the results. The savings are real, immediate, and compound over time.
Start monitoring for free → Sign up for AISpendGuard