Skip to main content
LLM Math Pro

Prompt Caching

Prompt caching ROI by use case: when it pays and when it doesn't

Updated May 23, 2026 · Byron Malone

Prompt caching (Anthropic: 10% rate on cache reads; OpenAI: 50% discount on cached prefixes) saves significant money for applications with large, stable system prompts. For a 10,000-token system prompt at 90% cache hit rate, Anthropic caching reduces system prompt costs by ~87%. The savings don't apply to user-specific content — only the stable prefix portion.

Advertisement

How prompt caching works technically

When an LLM processes a prompt, it computes a key-value (KV) attention cache for all input tokens. Normally this is discarded after each call — you pay to recompute it every time. Prompt caching stores the pre-computed KV cache for a specified prefix, so subsequent calls with the same prefix skip the recomputation and only pay for the cache read.

Anthropic's implementation (as of Claude 3 models): you mark content blocks with cache_control: {type: 'ephemeral'}. The cached block must be at least 1,024 tokens. The cache persists for 5 minutes and is refreshed with each access. Cache write: 125% of standard input rate (you pay a premium to store). Cache read: 10% of standard input rate (you pay only 1/10 of standard for subsequent calls).

OpenAI's implementation: automatic — no explicit marking required. Any prompt prefix of 1,024+ tokens that appears in multiple requests to the same model (within a session) is automatically cached at 50% off. Less granular control than Anthropic's explicit marking but simpler to implement.

The math: for Anthropic with 90% cache hit rate on a 10,000-token system prompt: Effective rate = 10% × $3/M (cache reads) + 10% × $3.75/M (cache writes) = $0.30/M × 0.90 + $3.75/M × 0.10 = $0.27/M + $0.375/M = $0.645/M. Compared to $3/M without caching — a 78.5% reduction in system prompt token cost.

High-ROI use cases for prompt caching

Applications where prompt caching delivers the highest ROI:

1. Customer service bots with large product catalogs or policy documents in the system prompt: a 15,000-token product knowledge base sent with every query is an ideal caching target. If you have 1,000 queries/hour with the same system prompt, 999 of those calls pay cache read rates instead of standard rates.

2. Document Q&A applications where the same document is queried multiple times: load a 50-page contract (30,000 tokens) once as a cached context, then run 20 different queries against it. The document context is paid for at cache read rates on queries 2-20.

3. Code assistants with large codebase context: inject 20,000 tokens of relevant code files into the context. Multiple developer queries during a coding session hit the cache.

4. Agents with large, stable system prompts defining their persona, tools, and instructions: the instructions portion is stable across all agent runs; the task-specific portion varies. Cache the instructions, pay standard rate only for the task-specific tokens.

5. RAG applications where the top-k retrieved chunks are identical across queries (shared knowledge base): cache the system prompt and any static context; only the user query and unique retrieved content is non-cached.

Low-ROI use cases: when caching doesn't help

Not all LLM applications benefit from caching:

1. Highly personalized prompts: if every call includes user-specific data (name, history, preferences, account information) in the cacheable prefix position, the prefix is different for every user and caching provides no benefit.

2. One-off processing tasks: batch document processing where each document is processed once and never queried again — no repeated access to the same cached prefix.

3. Small system prompts (<1,024 tokens): below the minimum cacheable block size (Anthropic) or unlikely to hit the cache threshold (OpenAI). The system prompt may not be worth the cache write overhead.

4. Low-volume applications (<1 call/5 minutes for Anthropic's 5-minute TTL): the cache expires before the next call, meaning every call is a cache write (125% rate) with no cache reads. You pay MORE than without caching.

5. Models that don't support caching: some smaller or newer model versions may not yet support caching. Verify provider documentation for each specific model version.

The Prompt Caching Savings Calculator models your cache hit rate based on calls per minute vs the cache TTL to avoid implementing caching for applications where it will cost more than it saves.

Implementation considerations and pitfalls

Getting prompt caching right in production:

1. Prefix stability: the cached prefix must be byte-for-byte identical across calls. Even a single character difference creates a cache miss and triggers a cache write. Common mistake: dynamically injecting the current date/time at the start of the system prompt before the stable content. Move the date injection to after the cacheable prefix.

2. Cache warming on cold start: when you first deploy or after a cache miss, the next call pays the cache write rate. Plan for this in high-concurrency scenarios — a burst of traffic after a cache expiry can create multiple parallel cache writes.

3. Anthropic multiblock caching: you can cache multiple blocks in a single message. Use this to cache a large system prompt AND a large document context independently — giving you more granular cache control.

4. Testing cache effectiveness: use the API response headers (Anthropic includes cache_read_input_tokens and cache_creation_input_tokens in the usage object) to verify your caching is working and measure actual cache hit rates. Don't assume caching is working — verify with production traffic data.

5. Cost accounting: track cached vs uncached token consumption separately in your cost dashboard so you can measure the actual ROI of caching overhead vs savings.

Advertisement

By Last updated

Founder & Editor, Bedrocka Tools

Try the calculator

This article pairs with thePrompt Caching Savings Calculator — which operationalizes the concepts above with your specific numbers.

Primary sources cited