guideMar 24, 202611 min read

Prompt Caching: The Single Change That Can Cut Your AI API Bill by 90%

OpenAI, Anthropic, and Google all offer prompt caching — but each works differently. Here's how to use them all, with real cost breakdowns.


Most developers are paying full price for the same tokens over and over again. If your application sends a system prompt, few-shot examples, or context documents with every API call, you're burning money on tokens the provider has already processed.

Prompt caching fixes this. It's the single highest-impact optimization available today — cutting input costs by up to 90% and latency by up to 85%. Yet most developers either don't know about it or haven't implemented it because every provider does it differently.

This guide breaks down exactly how prompt caching works across OpenAI, Anthropic, and Google, with real cost calculations so you can see exactly how much you'll save.

What Is Prompt Caching?

Every time you make an API call, the provider processes your entire prompt from scratch — system instructions, few-shot examples, conversation history, everything. Prompt caching tells the provider: "You've seen this part before. Don't reprocess it."

The provider stores a processed version of your prompt prefix. On subsequent requests with the same prefix, it skips the expensive computation and charges you a fraction of the normal input price.

The key insight: If 80% of your prompt is the same across requests (system prompt + examples + context), you're paying 80% too much on input tokens for every single call.

How Much Can You Actually Save?

Let's do the math. Say you're running a customer support bot using Claude Sonnet 4.6 that sends:

  • System prompt: 2,000 tokens
  • Few-shot examples: 3,000 tokens
  • Retrieved context: 4,000 tokens
  • User message: 500 tokens
  • Total per request: 9,500 input tokens

At 1,000 requests/day, that's 9.5 million input tokens/day.

Without Caching

ComponentTokens/dayCost (Sonnet 4.6 @ $3/1M)
All input tokens9,500,000$28.50/day
Monthly input cost$855.00

With Caching (Anthropic)

ComponentTokens/dayRateCost
Cached tokens (system + examples + context)9,000,000$0.30/1M (90% off)$2.70/day
Cache write (first request/5 min)~2,600,000$3.75/1M (1.25x)$9.75/day
Uncached tokens (user messages)500,000$3.00/1M$1.50/day
Monthly input cost~$130

That's an 85% reduction — from $855/month to ~$130/month on input tokens alone. For a single endpoint.

Stack this with the Batch API (50% off) for non-real-time workloads, and you're looking at costs below $65/month for the same volume.

Provider Comparison: How Each One Works

Here's the thing — every provider implements caching differently. Some require zero code changes; others need explicit markup. The discounts vary. The minimums vary. Here's the full breakdown:

FeatureOpenAIAnthropicGoogle
Cache discount50% off input90% off input75% off input
Cache write costFree (1x)1.25x standard1x standard
Min cacheable tokens1,0241,024 (Sonnet/Opus), 2,048 (Haiku)32,768
TTL~5-10 min (auto)5 min (refreshes on hit)1 hour (manual)
Code changes neededNoneYes (cache_control)Yes (CachedContent)
Cache granularity128-token incrementsBlock-levelWhole context
Best forDrop-in savingsFine-grained controlVery large contexts

OpenAI: Zero-Effort Caching

OpenAI made the smart move of making caching automatic. If your prompt starts with 1,024+ identical tokens across requests, you get cached pricing — no code changes needed.

# Nothing special required — OpenAI caches automatically
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": your_long_system_prompt},  # Cached automatically
        {"role": "user", "content": user_input}  # Not cached
    ]
)

# Check response headers for cache hit info
# usage.prompt_tokens_details.cached_tokens tells you how many hit cache

What to know:

  • Caching kicks in at 1,024 tokens minimum
  • Cache matches in 128-token increments after the first 1,024
  • Cached tokens cost 50% less ($1.25/1M instead of $2.50/1M for GPT-4o)
  • No cache write surcharge — the first request costs the same as always
  • Cache evicts automatically after ~5-10 minutes of inactivity

Pro tip: Structure your messages so the static content always comes first. System prompt → few-shot examples → context → user message. Any change in the prefix breaks the cache for everything after it.

Anthropic: Maximum Control, Maximum Savings

Anthropic's caching requires explicit markup but delivers the deepest discount — 90% off cached reads. You mark specific blocks for caching using cache_control.

response = client.messages.create(
    model="claude-sonnet-4-6-20260514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,
            "cache_control": {"type": "ephemeral"}  # Cache this block
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": large_context_document,
                    "cache_control": {"type": "ephemeral"}  # Cache this too
                },
                {
                    "type": "text",
                    "text": "What are the key findings?"  # Not cached
                }
            ]
        }
    ]
)

# Check usage.cache_creation_input_tokens and usage.cache_read_input_tokens

What to know:

  • Cached reads cost 10% of standard price (e.g., $0.30/1M vs $3.00/1M for Sonnet 4.6)
  • Cache writes cost 25% more than standard (first time only)
  • Cache TTL is 5 minutes, refreshed on each hit
  • Minimum 1,024 tokens for Sonnet/Opus, 2,048 for Haiku
  • You can cache up to 4 blocks per request
  • Extended TTL (1 hour) available at 2x write cost

Pro tip: The 5-minute TTL means caching is most effective for applications with steady traffic. If you get at least one request every 5 minutes, your cache stays warm and you keep the 90% discount indefinitely.

Breaking update (March 14, 2026): Anthropic removed the long-context surcharge for Claude Opus 4.6 and Sonnet 4.6. Previously, requests over 200K tokens cost double. Now the standard rate applies at any context length — making caching even more valuable for large-context applications.

Google: Built for Large Contexts

Google's caching is designed for massive context windows (Gemini supports up to 2M tokens) but has the highest minimum token requirement.

# Step 1: Create a cached content object
cache = client.caching.create(
    model="gemini-2.5-pro",
    contents=[large_document_content],
    ttl="3600s"  # 1 hour
)

# Step 2: Use the cache in requests
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=["Summarize the key points"],
    cached_content=cache.name
)

What to know:

  • Cached tokens cost 75% less than standard input pricing
  • Minimum 32,768 tokens (32K) — much higher than OpenAI/Anthropic
  • TTL defaults to 1 hour, configurable
  • Requires explicit cache creation as a separate API call
  • Supports Gemini 2.5 Pro and Flash models

Pro tip: Google's high minimum makes it ideal for RAG applications where you're sending large document chunks. If your context is under 32K tokens, use OpenAI or Anthropic's caching instead.

Real-World Savings by Use Case

Here's what caching saves across common application patterns, using real March 2026 pricing:

Use CaseModelDaily CallsWithout CacheWith CacheMonthly Savings
Customer support botClaude Sonnet 4.61,000$855/mo$130/mo$725
Code review assistantGPT-4o500$412/mo$225/mo$187
Document Q&A (RAG)Gemini 2.5 Pro2,000$2,250/mo$675/mo$1,575
Content classifierGPT-4o-mini5,000$67/mo$38/mo$29
Legal document analyzerClaude Opus 4.6200$900/mo$145/mo$755

Total across all use cases: $3,271/month in savings. That's the kind of number that pays for your entire AI infrastructure team.

5 Rules for Maximizing Cache Hit Rates

Getting caching set up is the easy part. Getting high cache hit rates is where the real savings come from.

1. Put Static Content First, Always

Cache matching works on prefixes. If your system prompt comes first, it gets cached. If you put dynamic content before static content, you break the cache.

✅ System prompt → Few-shot examples → Context docs → User message
❌ User message → System prompt → Context docs

2. Standardize Your System Prompts

Even a single character difference breaks the cache. If your system prompt includes a timestamp, date, or request ID — remove it or move it to the user message.

❌ "You are a helpful assistant. Today is March 24, 2026. ..."
✅ "You are a helpful assistant. ..." + user message includes date

3. Batch Similar Requests Together

If you have 50 classification tasks, send them in rapid succession. Each request refreshes the cache TTL, keeping it warm for the next one. Don't interleave different prompt templates.

4. Monitor Your Cache Hit Rate

You can't optimize what you can't measure. Every provider returns cache statistics in the response:

  • OpenAI: usage.prompt_tokens_details.cached_tokens
  • Anthropic: usage.cache_read_input_tokens and usage.cache_creation_input_tokens
  • Google: Cache hit/miss metadata in the response

Track your cache hit rate over time. If it drops below 70%, investigate why — you might have prompt drift, inconsistent formatting, or traffic gaps that let the cache expire.

Want to track cache hit rates alongside your total AI spend? AISpendGuard gives you per-call cost visibility with tag-based attribution — so you can see exactly which features are benefiting from caching and which are still paying full price.

5. Right-Size Your Cache TTL

  • High-traffic endpoints (10+ req/min): Default TTL is fine — the cache stays warm naturally
  • Medium-traffic endpoints (1-10 req/min): Consider Anthropic's extended TTL (1 hour at 2x write cost)
  • Low-traffic endpoints (<1 req/min): Caching may not help — the cache expires between requests. Focus on model selection instead

Combining Caching with Other Optimizations

Prompt caching is powerful on its own, but it stacks with other techniques for compound savings:

StrategySavingsStacks With Caching?
Prompt caching50-90% on input
Batch API50% on input + outputYes (up to 95% total)
Model downtier40-95% per callYes
Prompt trimming20-40% on inputPartially (less to cache)
Semantic caching (app-level)100% on cache hitsYes (avoids API calls entirely)

The most aggressive stack: Anthropic caching (90% off) + Batch API (50% off) = up to 95% reduction on input token costs. A call that would have cost $15 per million tokens now costs $0.75.

When Caching Doesn't Help

Be honest about when caching isn't the right optimization:

  • Short prompts (<1,024 tokens): Below the minimum threshold for all providers
  • Highly dynamic prompts: If every request has a unique prefix, nothing gets cached
  • Infrequent requests: If you're making fewer than 1 request per 5 minutes, the cache expires between calls
  • Output-heavy workloads: Caching only reduces input costs. If your costs are dominated by output tokens (e.g., long-form generation), focus on model selection or output limits instead

For these cases, check out our other guides on model selection and batch API savings.

Start Tracking Your Savings

Here's the hard truth: you can implement caching perfectly and still not know if it's working. Provider dashboards show you total spend, but they don't tell you:

  • Which features benefit most from caching
  • Whether your cache hit rate is improving or degrading
  • How caching savings compare across different models and endpoints
  • What your actual cost-per-feature is after all optimizations

This is exactly what AISpendGuard was built for. Tag your API calls by feature, customer tier, or model — and see exactly where every dollar goes. No prompts stored, no gateway required, no latency added.

See how much you could saveTry the cost calculator

TL;DR

  1. Prompt caching cuts input costs by 50-90% depending on provider
  2. OpenAI does it automatically — just structure prompts with static content first
  3. Anthropic gives 90% off but requires cache_control markup
  4. Google needs 32K+ tokens — best for large-context RAG applications
  5. Stack with Batch API for up to 95% total savings
  6. Monitor your cache hit rate — aim for 70%+ to justify the cache write costs
  7. Track your actual savings with per-call attribution, not just aggregate dashboards

Start with the provider you use most. Structure your prompts correctly. Measure the results. The savings are real, immediate, and compound over time.

Start monitoring for free → Sign up for AISpendGuard


Want to track your AI spend automatically?

AISpendGuard detects waste patterns, breaks down costs by feature, and recommends specific changes with $/mo savings estimates.