Prompt Caching ROI: When Anthropic's 90% Discount Actually Pays Off

Last updated: 2026-05-21 · 11 min read

Part of the toksum.dev Guides series.

What prompt caching actually is

Every time you send a request to an LLM API, the provider computes key-value (KV) attention matrices for every token in your input. This computation is the most expensive part of processing your prompt, and it is repeated from scratch on every request even if 80% of your input is identical to the previous one. Prompt caching is a mechanism where the provider stores the computed KV matrices for a portion of your input, keyed to the exact token sequence, and reuses them on subsequent requests that share the same prefix. You pay a reduced rate for cache-hit tokens because the expensive computation has already been done.

The economics are significant. Anthropic charges $3.00/1M tokens for standard input on Claude 3.5 Sonnet, $3.75/1M for cache writes (a 25% premium for writing to cache), and $0.30/1M for cache reads — 10% of the standard input rate. OpenAI charges $2.50/1M for standard GPT-4o input and $1.25/1M for cache-hit tokens — a 50% reduction, applied automatically without any changes to your request format. Google's context caching on Gemini charges a storage fee per token per hour plus a reduced read rate, making the economics different and more complex to model.

The key insight is that prompt caching is a fixed cost (the write) amortized across variable-count reads. If you write a cache entry once and read it 1,000 times, the write cost becomes negligible per-read. If you write it and only read it twice before the cache expires, the economics may not be favorable. Understanding when caching pays off requires modeling your specific request pattern, not just comparing the per-token rates in isolation.

Anthropic specifics: explicit caching with cache_control

Anthropic's prompt caching is explicit — you must opt into it by adding a cache_control parameter to the portion of your prompt you want cached. In the Messages API, you add "cache_control": {"type": "ephemeral"} to the content block marking the end of the prefix you want the provider to cache up to that point. The cache is keyed to the exact token sequence from the beginning of the request up to the breakpoint.

The ephemeral cache has a 5-minute lifetime that resets with each hit. If your system is sending more than one request per 5 minutes that shares the same prefix, the cache stays warm continuously. For most production systems with even modest traffic, this is easily satisfied. The extended cache (available on Claude 3.5 Sonnet and newer models) has a 1-hour lifetime, which is useful for lower-traffic applications that still benefit from caching, or for batch workloads where requests are spread across a longer window.

Cache writes are charged at $3.75/1M — a 25% premium over the standard $3.00/1M input rate. This write premium is the cost of the infrastructure to store and serve the cached KV matrices. Once written, cache reads cost $0.30/1M. The break-even between caching and not caching is therefore simple to compute: you need enough cache reads to offset the additional write cost. At the Claude 3.5 Sonnet rates, the write premium is $0.75/1M tokens. Each cache read saves $2.70/1M tokens compared to a standard input read ($3.00 - $0.30). The write cost is recovered after just: $0.75 ÷ $2.70 ≈ 0.28 additional reads. In other words, if you send the same prefix more than once after the first cache write, caching is cheaper than not caching. The break-even is essentially the second request.

The minimum cacheable prefix is 1,024 tokens. Shorter prefixes fall back to standard input pricing silently — the API does not return an error; the cache_control parameter is simply ignored. Always verify that your system prompt or context is long enough to qualify, and check the response's usage metadata for cache_creation_input_tokens and cache_read_input_tokens fields to confirm the cache is actually being hit.

OpenAI's implicit caching on GPT-4o

OpenAI's approach to caching is the opposite of Anthropic's in terms of developer ergonomics: caching is automatic and requires no changes to your request format. OpenAI's infrastructure detects repeated prefixes at least 1,024 tokens long and applies the cached rate automatically. The cache-hit tokens appear in your API response's usage object as cached_tokens, and you are billed $1.25/1M for those tokens rather than $2.50/1M.

The implicit approach is easier to use but less transparent. You cannot force a cache write or verify cache status before sending requests. Cache lifetime is not formally documented for GPT-4o (OpenAI describes it as "a few minutes to a few hours" depending on server load and traffic patterns). This unpredictability makes OpenAI caching less reliable as a cost planning assumption — you may get cache hits during business hours when your prompts are frequently repeated, and miss during off-peak hours. For precise cost modeling, treat OpenAI cache hits as a probabilistic discount rather than a guaranteed one.

OpenAI does not charge a separate cache write fee, which simplifies the economics: any cache hit is a pure saving with no associated cost. The 50% discount on cache-hit tokens is straightforward. However, the savings are smaller in absolute terms than Anthropic's 90% discount for cache reads, which matters significantly at high token volumes. For a detailed comparison of how these rates affect real monthly bills, see the GPT-4o vs Claude 3.5 Sonnet cost comparison.

Worked example: 50K-token system prompt reused 100 times per day

This is the scenario where Anthropic prompt caching delivers its most dramatic ROI. Imagine a legal document analysis assistant with a 50,000-token system prompt containing jurisdiction-specific rules, output templates, and few-shot examples. The same system prompt goes to Claude 3.5 Sonnet on every request. The product receives 100 requests per day, each with a unique 2,000-token user query and producing a 1,500-token analysis.

Without caching (standard pricing):

With Anthropic prompt caching (ephemeral, 5-min lifetime, 100 req/day = ~4/hr = easily warm):

Monthly savings: $399.90/month — a 74.7% reduction in total cost from a single configuration change. The system prompt tokens go from being the dominant cost to nearly irrelevant, and the output tokens become the largest line item. The savings grow linearly with request volume: at 1,000 requests per day, the monthly savings would be approximately $3,500.

The extended cache (1-hour lifetime) is relevant for lower-traffic scenarios. If your product receives 10 requests per day spread unevenly throughout business hours, the ephemeral 5-minute cache will expire between requests and you will incur multiple cache writes per day. With the 1-hour cache, a single morning write can cover all requests in the same hour, reducing write costs significantly even at low traffic volumes.

Patterns that work well with caching

Long RAG context prefixes. Retrieval-augmented generation systems often prepend a large set of retrieved document chunks to every request. If you retrieve from a fixed knowledge base and the retrieved chunks are consistent across many requests (e.g., the same product documentation is retrieved for a high-traffic category of questions), the retrieved context can be cached. The ideal pattern is to structure your prompt so the static context comes first (and is marked for caching) and the user query comes at the end after the cache breakpoint.

Persona and rules headers. A long system prompt that defines the AI persona, output formatting rules, safety guidelines, and behavioral constraints is a natural caching target. These change rarely (perhaps weekly with product updates) and are identical across every user session. A 3,000-token persona-and-rules block cached across 10,000 daily requests saves $81/month on Claude 3.5 Sonnet at the cache read rate versus standard input — modest individually but meaningful when multiplied across a product portfolio.

Few-shot example libraries. Including 10–20 high-quality examples of the input-output format you want the model to follow improves output quality significantly. These examples can be 5,000–15,000 tokens. Caching a large few-shot library makes including extensive examples economically feasible, because you pay cache-read rates instead of standard input rates for them on every request.

Codebase context for coding assistants. AI coding tools that inject file trees, key function definitions, or API documentation into every request can cache that context. The codebase changes less frequently than the user's questions about it, making it an excellent caching target. See our migration guide for specifics on wiring cache_control into Anthropic SDK requests.

Patterns that do NOT work well with caching

Short prompts under 1,024 tokens. The minimum prefix size for Anthropic caching is 1,024 tokens. If your entire system prompt plus any fixed context is shorter than this, caching is unavailable. The API accepts the cache_control parameter but ignores it silently. Short prompts should rely on other cost levers: model selection, output length control, or batch API.

Single-turn requests with fully unique inputs. If every request is completely unique — no shared prefix, no reusable context — there is nothing to cache. Content generation from different prompts each time, one-shot document translation, or requests where the "system prompt" is derived from the user's request each time are all poor caching candidates. Caching requires a stable, repeated prefix.

High-churn system prompts. If your system prompt changes frequently — different instructions per user, per session, or per query — the cache will miss constantly and you will pay the cache write premium without collecting cache read savings. Personalized system prompts that incorporate user preferences or session history per-request are incompatible with effective caching unless you can separate the stable portion (global rules) from the variable portion (per-user customization) and cache only the former.

Very low traffic products. If you receive fewer than a few requests per hour, the ephemeral cache will expire between requests and you will pay a write fee on most requests rather than amortizing it. Evaluate the extended cache option if available, or calculate whether the write overhead exceeds the read savings at your specific request rate before enabling caching.

Frequently asked questions

How long does Anthropic keep a prompt cache alive?
Anthropic offers two cache lifetimes. The default ephemeral cache lasts 5 minutes from the last request that touched that cache entry. The extended cache (available on supported models) lasts 1 hour. Cache lifetime resets every time a request hits the same cache entry. If your request rate is high enough to keep the cache warm, the cache can stay alive indefinitely in practice.
Does OpenAI charge a cache write fee like Anthropic?
No. OpenAI applies caching automatically on eligible inputs and charges only the cache read rate ($1.25/1M tokens) when a cache hit occurs. There is no explicit cache write fee — OpenAI absorbs the write cost. Anthropic charges $3.75/1M for cache writes explicitly, which means you need to think about break-even point: the write cost is amortized across subsequent reads.
What is the minimum prefix size for Anthropic prompt caching?
Anthropic requires a minimum of 1,024 tokens before a cache_control breakpoint to qualify for caching. Shorter prefixes are not cached and fall back to standard input pricing. This threshold means prompt caching only pays off for long system prompts, large document contexts, or other substantial shared prefixes — it is not applicable to short chat messages.
Can prompt caching combine with the batch API discount?
Yes, on Anthropic. You can use cache_control breakpoints in batch requests submitted to the Message Batches API. The batch discount (50% off) applies to non-cached tokens, and the cache read rate ($0.30/1M) applies to cached tokens. These are the most favorable combined economics available on any major LLM API: a long cached system prompt plus batch pricing for the uncached portion.

Related

Guide

Batch API Explained

Combine batch discounts with prompt caching for the lowest per-token cost on any provider.

Compare

GPT-4o vs Claude 3.5 Sonnet

See how Anthropic's 90% cache read rate compares to OpenAI's 50% automatic cache discount.

Guide

Migrating from OpenAI to Anthropic

How to wire cache_control into Anthropic SDK requests during a production migration.