Q: How do I calculate my cache hit rate?

Cache hit rate = (calls that can use the cache) / (total calls). A call uses the cache if: (1) The cached prefix hasn't expired (5-minute TTL for Anthropic ephemeral cache). (2) The prefix is byte-for-byte identical to what was cached. For high-volume applications (>10 calls/minute), near-100% cache hit rate is achievable for the stable portion of your prompt. For low-volume applications (<1 call/5 minutes), the cache frequently expires and hit rates are much lower. This calculator takes the hit rate as a direct input so you can model the best case, the worst case, and the steady-state you actually observe.

Q: How much can prompt caching save on a large system prompt?

Example: Claude 3.5 Sonnet, 8,000-token system prompt, $3/M standard input rate. Without caching: 8,000 × $3/1,000,000 = $0.024 per call for the system prompt alone. With caching (90% hit rate): 10% of calls pay cache write ($3.75/M × 8,000 = $0.030); 90% pay cache read ($0.30/M × 8,000 = $0.0024). Blended cost per call = 0.10 × $0.030 + 0.90 × $0.0024 = $0.003 + $0.00216 = $0.00516. Savings vs uncached: $0.024 - $0.00516 = $0.01884 per call. At 10,000 calls/day: $188.40/day savings = $5,652/month.

Q: Does prompt caching affect response quality?

No. The cached KV values are mathematically equivalent to recomputing them from scratch — the model outputs identical distributions whether it reads from cache or recomputes. There is no quality degradation from caching. The only behavioral difference is latency: cache hits are faster (skip the input processing for the cached prefix) and cache writes are slightly slower (compute + store). For streaming applications, the time-to-first-token is faster on cache hits.

Q: How does Anthropic's caching compare to OpenAI's?

Anthropic (as of mid-2024): cache_control blocks in message content; 1,024 token minimum; 5-minute TTL; cache write = 125% of standard; cache read = 10% of standard. OpenAI (as of November 2024): automatic prompt caching for prompts 1,024+ tokens (no explicit control needed); 50% discount on cached input tokens; caches the longest prefix that is cached and matches the new request. Key difference: Anthropic requires explicit cache_control markers (giving developers precise control); OpenAI caches automatically (simpler but less control). Both offer substantial savings for appropriate workloads — model each with the calculator by setting the write/read multipliers to match the provider you're pricing.

Q: When is prompt caching NOT worth it?

Caching is a net loss when the prefix rarely repeats before the cache expires, because every miss pays a cache-write premium (125% of base for Anthropic) and you never collect the read discount. Concretely: at a 0% hit rate, caching a prefix costs MORE than not caching it. The break-even is governed by the blended prefix factor = hitRate × readMultiplier + (1 − hitRate) × writeMultiplier; caching only saves money while that factor is below 1.0. Short, unique, low-volume prompts — one-off classification calls, a personalized prompt that changes every request, anything under the 1,024-token minimum — should not be cached. Set the hit rate low in the calculator and watch the savings go negative.

Q: Why are prices and cache multipliers editable instead of preset?

Because provider pricing moves and a calculator that hard-codes a rate quietly returns wrong numbers the day a price changes. The example presets are dated and labeled 'verify current pricing' on purpose — they're a starting point, not an authority. The engine treats basePricePerM, outputPricePerM, the cache-write multiplier, and the cache-read multiplier as your inputs. The typical defaults (cache-write ≈ 1.25× base, cache-read ≈ 0.1× base, 90% hit rate) reflect common published economics, but you should paste your own model's current per-million rates from the provider's pricing page before trusting a forecast.

Question 1

What is prompt caching and how does it work technically?

Accepted Answer

When you make an LLM API call, the model must process (attend over) all input tokens to compute the key-value (KV) attention cache. Prompt caching stores this pre-computed cache on the provider's infrastructure so subsequent calls with the same prefix can skip the recomputation. Anthropic's implementation: add a cache_control: {type: 'ephemeral'} block to the content you want cached. The cache lasts 5 minutes (ephemeral) and resets the timer on each access. Cache writes cost 125% of standard input rate; cache reads cost 10% of standard rate. The minimum cacheable block is 1,024 tokens.

Question 2

What types of applications benefit most from prompt caching?

Accepted Answer

Applications with large, stable context that repeats across calls: (1) Customer service bots with large system prompts (product catalogs, policies, procedures). (2) Document Q&A where the document is large and the same document is queried multiple times. (3) Code assistants with large codebases injected into context. (4) Multi-turn conversations where conversation history grows but the system prompt is fixed. (5) RAG applications where the same retrieved chunks appear across multiple queries. Applications that don't benefit: one-off single calls, highly personalized prompts that change every call, short system prompts (<1,024 tokens).

Question 3

How do I calculate my cache hit rate?

Accepted Answer

Cache hit rate = (calls that can use the cache) / (total calls). A call uses the cache if: (1) The cached prefix hasn't expired (5-minute TTL for Anthropic ephemeral cache). (2) The prefix is byte-for-byte identical to what was cached. For high-volume applications (>10 calls/minute), near-100% cache hit rate is achievable for the stable portion of your prompt. For low-volume applications (<1 call/5 minutes), the cache frequently expires and hit rates are much lower. This calculator takes the hit rate as a direct input so you can model the best case, the worst case, and the steady-state you actually observe.

Question 4

How much can prompt caching save on a large system prompt?

Accepted Answer

Example: Claude 3.5 Sonnet, 8,000-token system prompt, $3/M standard input rate. Without caching: 8,000 × $3/1,000,000 = $0.024 per call for the system prompt alone. With caching (90% hit rate): 10% of calls pay cache write ($3.75/M × 8,000 = $0.030); 90% pay cache read ($0.30/M × 8,000 = $0.0024). Blended cost per call = 0.10 × $0.030 + 0.90 × $0.0024 = $0.003 + $0.00216 = $0.00516. Savings vs uncached: $0.024 - $0.00516 = $0.01884 per call. At 10,000 calls/day: $188.40/day savings = $5,652/month.

Question 5

Does prompt caching affect response quality?

Accepted Answer

No. The cached KV values are mathematically equivalent to recomputing them from scratch — the model outputs identical distributions whether it reads from cache or recomputes. There is no quality degradation from caching. The only behavioral difference is latency: cache hits are faster (skip the input processing for the cached prefix) and cache writes are slightly slower (compute + store). For streaming applications, the time-to-first-token is faster on cache hits.

Question 6

How does Anthropic's caching compare to OpenAI's?

Accepted Answer

Anthropic (as of mid-2024): cache_control blocks in message content; 1,024 token minimum; 5-minute TTL; cache write = 125% of standard; cache read = 10% of standard. OpenAI (as of November 2024): automatic prompt caching for prompts 1,024+ tokens (no explicit control needed); 50% discount on cached input tokens; caches the longest prefix that is cached and matches the new request. Key difference: Anthropic requires explicit cache_control markers (giving developers precise control); OpenAI caches automatically (simpler but less control). Both offer substantial savings for appropriate workloads — model each with the calculator by setting the write/read multipliers to match the provider you're pricing.

Question 7

When is prompt caching NOT worth it?

Accepted Answer

Caching is a net loss when the prefix rarely repeats before the cache expires, because every miss pays a cache-write premium (125% of base for Anthropic) and you never collect the read discount. Concretely: at a 0% hit rate, caching a prefix costs MORE than not caching it. The break-even is governed by the blended prefix factor = hitRate × readMultiplier + (1 − hitRate) × writeMultiplier; caching only saves money while that factor is below 1.0. Short, unique, low-volume prompts — one-off classification calls, a personalized prompt that changes every request, anything under the 1,024-token minimum — should not be cached. Set the hit rate low in the calculator and watch the savings go negative.

Question 8

Why are prices and cache multipliers editable instead of preset?

Accepted Answer

Because provider pricing moves and a calculator that hard-codes a rate quietly returns wrong numbers the day a price changes. The example presets are dated and labeled 'verify current pricing' on purpose — they're a starting point, not an authority. The engine treats basePricePerM, outputPricePerM, the cache-write multiplier, and the cache-read multiplier as your inputs. The typical defaults (cache-write ≈ 1.25× base, cache-read ≈ 0.1× base, 90% hit rate) reflect common published economics, but you should paste your own model's current per-million rates from the provider's pricing page before trusting a forecast.

Prompt Caching Savings Calculator

What this means

Worked example

Frequently asked questions

Show the math