What is prompt caching?

Prompt caching lets you reuse the pre-computed attention cache for a stable prefix of your prompt instead of recomputing it on every call. When a model processes a prompt it builds a key-value (KV) attention cache for all input tokens; normally that work is thrown away after each request. Caching stores the KV cache for a marked prefix, so later calls with the identical prefix skip the recomputation and only pay a reduced cache-read rate for those tokens.

How is a cache write priced differently from a cache read?

A cache write costs more than a normal input token because you pay a premium to store the prefix; a cache read costs much less because you skip the recomputation. As a dated example, Anthropic's published rates put a cache write at 125% of the standard input rate and a cache read at 10% of it, while OpenAI applies roughly a 50% discount on cached prefixes. The exact multipliers are provider-specific and change over time, so always confirm against current pricing pages before modeling savings.

What is a cache hit rate and why does it matter?

Cache hit rate is the share of your calls that land on an already-stored prefix (a cache read) rather than triggering a new cache write. It is the single biggest driver of whether caching saves money: at a high hit rate most calls pay the cheap read rate, but at a low hit rate you keep paying the write premium with few reads to offset it. Hit rate depends on how often you call relative to the cache's time-to-live and on how stable your prefix is across requests.

Which use cases benefit most from prompt caching?

Applications with a large, stable prefix that is reused across many calls benefit most: customer-service bots carrying a big product or policy knowledge base, document Q&A where one long document is queried many times, code assistants holding a fixed codebase context, agents with stable system prompts and tool definitions, and RAG setups where the system prompt and shared context stay constant while only the user query changes. In each case the expensive prefix is paid for once at the write rate and then read cheaply.

When does prompt caching not help?

Caching provides little or no benefit when the cacheable prefix changes on every call or the prefix is reused too rarely to recover the write premium. Highly personalized prompts that put user-specific data up front, one-off batch processing where each document is seen once, system prompts below the minimum cacheable size, and low-volume apps that call less often than the cache time-to-live all fall into this bucket. In the low-volume case you can actually pay more than without caching, because nearly every call becomes a cache write.

Does prompt caching reduce the cost of user-specific content?

No — caching only discounts the stable prefix portion of the prompt, not the parts that vary per request. The user's question, per-user data, and any uniquely retrieved content are still billed at standard input rates because they differ on every call and cannot be served from a stored cache. That is why caching is best understood as a discount on a shared system prompt or static context, not a blanket reduction on your whole token bill.

How can I tell whether my caching is actually working?

Measure it from the API response rather than assuming. Anthropic returns cache_read_input_tokens and cache_creation_input_tokens in the usage object so you can see how many tokens were served from cache versus written, and you should track cached versus uncached consumption separately in your cost dashboard. The most common silent failure is prefix drift — injecting a timestamp or other changing text ahead of the stable content makes the prefix differ byte-for-byte and turns every call into a cache miss.

Can prompt caching cost more than it saves?

Yes. Because a cache write is billed at a premium, an application that writes the cache often but rarely reads it will spend more than one with no caching at all. This is the failure mode for low-volume workloads where calls arrive less frequently than the cache's time-to-live, so the stored prefix expires before the next call reuses it. The way to avoid it is to model your calls-per-minute against the cache TTL to estimate hit rate before turning caching on.

Prompt Caching

Prompt caching ROI by use case: when it pays and when it doesn't

Updated May 23, 2026 · Byron Malone

Prompt caching (Anthropic: 10% rate on cache reads; OpenAI: 50% discount on cached prefixes) saves significant money for applications with large, stable system prompts. For a 10,000-token system prompt at 90% cache hit rate, Anthropic caching reduces system prompt costs by ~87%. The savings don’t apply to user-specific content — only the stable prefix portion.

How it’s calculated

Effective prefix rate (blended) =
    (hit_rate × cache_read_multiplier × base_rate)
  + ((1 − hit_rate) × cache_write_multiplier × base_rate)

Prefix savings vs base = 1 − (effective_prefix_rate ÷ base_rate)

where:
  base_rate              = standard input token rate ($/M tokens)
  cache_read_multiplier  = fraction of base_rate paid on a cache HIT
  cache_write_multiplier = fraction of base_rate paid on a cache MISS (write)
  hit_rate               = share of calls that reuse a stored prefix

Dated example (Anthropic rates, verify current pricing):
  base_rate              = $3.00 / M
  cache_read_multiplier  = 0.10   (10% of base)
  cache_write_multiplier = 1.25   (125% of base)
  hit_rate               = 0.90

  effective_prefix_rate  = 0.90 × 0.10 × $3.00 + 0.10 × 1.25 × $3.00
                         = $0.27 / M + $0.375 / M
                         = $0.645 / M
  prefix savings         = 1 − ($0.645 ÷ $3.00) ≈ 78.5%

Assumptions:the cache-read and cache-write multipliers and the cache time-to-live are provider-specific and change over time — the 0.10× read / 1.25× write figures above are an Anthropic-rates example, and OpenAI’s automatic discount (~50% off cached prefixes) uses a different model entirely. Verify the current numbers against each provider’s pricing page. Savings apply only to the stable prefix portion of the prompt, not to user-specific tokens, and the realized hit rate depends on how often you call relative to the cache TTL. Below a provider’s minimum cacheable block size, caching does not apply.

Math, assumptions, and the savings model are operationalized in the prompt-caching methodology and the open-source calculator source on GitHub (packages/calc).

How prompt caching works technically

When an LLM processes a prompt, it computes a key-value (KV) attention cache for all input tokens. Normally this is discarded after each call — you pay to recompute it every time. Prompt caching stores the pre-computed KV cache for a specified prefix, so subsequent calls with the same prefix skip the recomputation and only pay for the cache read.

Anthropic's implementation (as of Claude 3 models): you mark content blocks with cache_control: {type: 'ephemeral'}. The cached block must be at least 1,024 tokens. The cache persists for 5 minutes and is refreshed with each access. Cache write: 125% of standard input rate (you pay a premium to store). Cache read: 10% of standard input rate (you pay only 1/10 of standard for subsequent calls).

OpenAI's implementation: automatic — no explicit marking required. Any prompt prefix of 1,024+ tokens that appears in multiple requests to the same model (within a session) is automatically cached at 50% off. Less granular control than Anthropic's explicit marking but simpler to implement.

The math: for Anthropic with 90% cache hit rate on a 10,000-token system prompt: Effective rate = 10% × $3/M (cache reads) + 10% × $3.75/M (cache writes) = $0.30/M × 0.90 + $3.75/M × 0.10 = $0.27/M + $0.375/M = $0.645/M. Compared to $3/M without caching — a 78.5% reduction in system prompt token cost.

High-ROI use cases for prompt caching

Applications where prompt caching delivers the highest ROI:

1. Customer service bots with large product catalogs or policy documents in the system prompt: a 15,000-token product knowledge base sent with every query is an ideal caching target. If you have 1,000 queries/hour with the same system prompt, 999 of those calls pay cache read rates instead of standard rates.

2. Document Q&A applications where the same document is queried multiple times: load a 50-page contract (30,000 tokens) once as a cached context, then run 20 different queries against it. The document context is paid for at cache read rates on queries 2-20.

3. Code assistants with large codebase context: inject 20,000 tokens of relevant code files into the context. Multiple developer queries during a coding session hit the cache.

4. Agents with large, stable system prompts defining their persona, tools, and instructions: the instructions portion is stable across all agent runs; the task-specific portion varies. Cache the instructions, pay standard rate only for the task-specific tokens.

5. RAG applications where the top-k retrieved chunks are identical across queries (shared knowledge base): cache the system prompt and any static context; only the user query and unique retrieved content is non-cached.

Low-ROI use cases: when caching doesn't help

Not all LLM applications benefit from caching:

1. Highly personalized prompts: if every call includes user-specific data (name, history, preferences, account information) in the cacheable prefix position, the prefix is different for every user and caching provides no benefit.

2. One-off processing tasks: batch document processing where each document is processed once and never queried again — no repeated access to the same cached prefix.

3. Small system prompts (<1,024 tokens): below the minimum cacheable block size (Anthropic) or unlikely to hit the cache threshold (OpenAI). The system prompt may not be worth the cache write overhead.

4. Low-volume applications (<1 call/5 minutes for Anthropic's 5-minute TTL): the cache expires before the next call, meaning every call is a cache write (125% rate) with no cache reads. You pay MORE than without caching.

5. Models that don't support caching: some smaller or newer model versions may not yet support caching. Verify provider documentation for each specific model version.

The Prompt Caching Savings Calculator models your cache hit rate based on calls per minute vs the cache TTL to avoid implementing caching for applications where it will cost more than it saves.

Implementation considerations and pitfalls

Getting prompt caching right in production:

1. Prefix stability: the cached prefix must be byte-for-byte identical across calls. Even a single character difference creates a cache miss and triggers a cache write. Common mistake: dynamically injecting the current date/time at the start of the system prompt before the stable content. Move the date injection to after the cacheable prefix.

2. Cache warming on cold start: when you first deploy or after a cache miss, the next call pays the cache write rate. Plan for this in high-concurrency scenarios — a burst of traffic after a cache expiry can create multiple parallel cache writes.

3. Anthropic multiblock caching: you can cache multiple blocks in a single message. Use this to cache a large system prompt AND a large document context independently — giving you more granular cache control.

4. Testing cache effectiveness: use the API response headers (Anthropic includes cache_read_input_tokens and cache_creation_input_tokens in the usage object) to verify your caching is working and measure actual cache hit rates. Don't assume caching is working — verify with production traffic data.

5. Cost accounting: track cached vs uncached token consumption separately in your cost dashboard so you can measure the actual ROI of caching overhead vs savings.

When caching actually pays off — an operator’s view

In my experience the deciding factor is almost never the headline read-rate discount — it’s call frequency against the cache time-to-live. I’ve found that teams fixate on the “10% on cache reads” number and forget you only get there after eating the write premium, so a low-traffic feature can quietly cost morewith caching on than off. Worked example: an internal tool with a 12,000-token system prompt that fires roughly once every fifteen minutes sits well outside a five-minute TTL, so nearly every call is a fresh cache write at the ~1.25× multiplier — about $45/M effective on the prefix versus $36/M with no caching at all. Put that same prompt behind a customer-facing endpoint doing a few calls per second and the hit rate jumps past 90%, the blended prefix rate falls to roughly $0.65/M, and caching becomes the single biggest line-item win available. Same prompt, opposite verdict — the traffic pattern, not the prompt, decides it (multipliers are dated examples; verify current provider pricing).

Frequently asked questions

By Byron MaloneLast verified May 2026 against Anthropic & OpenAI prompt-caching documentation

Founder & Editor, Bedrocka Tools

Try the calculator

This article pairs with the Prompt Caching Savings Calculator — which operationalizes the concepts above with your specific numbers.