Question 1

What is a realistic cost per monthly active user for an AI product?

Accepted Answer

Industry benchmarks (2025, AI-native products): Consumer AI apps: $0.50-3.00/MAU/month for moderate usage AI features. Business productivity tools: $2-8/MAU/month with heavy AI feature usage. Vertical AI copilots (legal, medical, code): $5-25/MAU/month. AI-first products where LLM is the core value: $8-30/MAU/month. Sustainability check: if your AI costs exceed 30% of your gross margin contribution per user, unit economics are challenged. At $20/month subscription, 30% gross margin = $6 available for AI costs — generous for most use cases but tight for heavy frontier-model usage. The cost-per-MAU figure this modeler computes is the number to put next to your per-user gross margin before you scale.

Question 2

How do I model LLM costs for a multi-step agent?

Accepted Answer

Agents make multiple LLM calls per user action — each step adds cost. A 5-step research agent might: (1) Parse user intent (200 input + 50 output tokens). (2) Generate search queries (300 input + 100 output). (3) Synthesize 3 retrieved documents (5,000 input + 200 output). (4) Generate draft (5,000 input + 800 output). (5) Review and refine (6,000 input + 500 output). Total per agent run: ~16,500 input + 1,650 output. At a $3/M input + $15/M output price: (16,500/1M × $3) + (1,650/1M × $15) = $0.0495 + $0.0248 = $0.074/run. At 1,000 runs/day = $74/day = $2,220/month. In this modeler, set steps-per-request to your average number of model calls and the token totals scale up automatically — it is the field that turns a naive single-call estimate into the real agent bill.

Question 3

Why are token prices an input instead of built into the calculator?

Accepted Answer

Because LLM prices change faster than any static table can stay correct. Providers cut prices, launch cheaper tiers, and release new models monthly; a hard-coded rate would be wrong within weeks and would quietly hand you a wrong projection. So the engine takes input-price-per-1M and output-price-per-1M as arguments and computes from exactly the numbers you supply. The example presets on this page are dated and labeled — treat them as a convenience, always confirm the current figure on the provider's official pricing page before you rely on the result. This is also why the math is open source: you can read the file and confirm there is no hidden price assumption anywhere in it.

Question 4

How should I set usage limits to control costs?

Accepted Answer

Hard limits (enforce these in production): (1) Per-request max_tokens cap — prevents runaway generation costs. (2) Per-user daily/monthly token budget — tracked in your database; return a graceful error when exceeded. (3) Provider-level spending caps — OpenAI, Anthropic, and Google all support hard spend limits and notifications. Soft controls: (1) Streaming + early termination when the key answer appears. (2) Model downgrade for non-premium users (cheaper small model for free tier; upgrade frontier model for paid). (3) Rate limiting per user. (4) Caching responses for identical or near-identical queries. A practical workflow: model your P50 usage here, then re-run with a 30-50% higher requests-per-user figure to see the headroom you need before a spending cap should trip.

Question 5

What is a reasonable LLM cost-to-revenue ratio for a sustainable business?

Accepted Answer

AI cost as a percentage of revenue benchmarks: Consumer AI tools: target <10% of revenue. B2B SaaS with AI features: target <5-8% of revenue (AI is a feature, not the core). AI-first services: 15-30% is acceptable if the AI is the core value proposition and margins are otherwise high. Danger zone: >40% of revenue going to LLM API costs suggests fundamental unit-economics problems — either the product uses too much AI per user action, pricing is too low, or you need to shift to self-hosted models. Take the annual cost this modeler computes, divide by your projected annual revenue, and you have the ratio to track. If it is climbing toward 40%, the lever is almost always tokens per request or steps per request, not the per-token price.

Question 6

How do I budget for LLM evaluation and fine-tuning costs?

Accepted Answer

Often-missed LLM budget line items: (1) Evaluation runs: running automated evaluations on 500-1,000 sample prompts after model updates or prompt changes. At $3/M input for a 2,000-token average eval prompt: $3/eval run × 500 samples = $1,500/eval cycle. (2) Fine-tuning (if applicable): training-token charges plus an inference premium on the tuned model. (3) Embedding calls for vector search: typically a few cents per million tokens — very cheap but it accumulates at scale. (4) Development and testing: 10-20% overhead on production volume for QA, prompt engineering, and A/B testing. This modeler covers the production-inference line; add these as separate line items, modeling each as its own usage profile with its own token counts and price.

Question 7

Can I use this to compare two models or providers?

Accepted Answer

Yes — that is one of the most useful workflows. Hold your usage profile constant (same MAU, requests, tokens in/out, steps) and run the modeler twice with the two providers' published per-1M prices. The monthly-cost, cost-per-user, and cost-per-request numbers give you a clean apples-to-apples comparison at your actual usage, not at a generic benchmark. Because output is usually priced 3-5× the input rate, output-heavy workloads (long generations) and input-heavy workloads (large retrieved contexts) can flip which provider is cheaper — running both numbers is the only reliable way to know.

Question 8

Does the cost picture change at 10x scale?

Accepted Answer

Yes — several dynamics shift, though most are not in the raw arithmetic this modeler computes. (1) Caching effectiveness improves: more users hit the same cached prefixes, driving down the effective per-call cost. (2) Volume discounts become negotiable: enterprise agreements typically offer meaningful discounts off list at significant monthly spend. (3) Self-hosting becomes economic for high-volume, non-frontier tasks. (4) Architectural improvements compound: at scale, teams invest in retrieval optimization, chunking, and prompt compression that cut token counts 20-40%. To see your 10x line, multiply your MAU by ten and re-run; then re-run again with a lower tokens-per-request figure to model the prompt-compression savings you would invest in at that volume.

LLM Usage & Budget Modeler

What this means

Worked example

Frequently asked questions

Show the math