What is a token in LLM pricing?

A token is the unit of text an LLM processes — roughly 4 characters or 0.75 words in English. Tokenizers split text into subword units using algorithms like Byte Pair Encoding, so 'Hello, world!' is about 4 tokens, a 100-word paragraph is around 130 tokens, and a 2,000-word document is around 2,600 tokens. Providers bill on token count because it is the primary cost driver for inference, and non-English text usually costs more tokens per word.

How is LLM cost per million tokens calculated?

Cost = tokens ÷ 1,000,000 × price-per-million. Input and output tokens are priced separately, so you compute each leg and add them. For example, a call with 3,650 input tokens and 400 output tokens, at a hypothetical $3 per million input and $15 per million output, costs (3,650 ÷ 1,000,000 × $3) + (400 ÷ 1,000,000 × $15) = $0.01095 + $0.006 = about $0.017. Prices vary by provider and change over time, so treat any figure as a dated example.

Why are output tokens more expensive than input tokens?

Output tokens are typically 3–5× the price of input tokens because of how the compute works. Input tokens are processed in parallel — attention runs over the whole prompt at once — while output tokens are generated one at a time, each requiring its own forward pass through the model. The pricing reflects that compute asymmetry, which is why a short prompt that produces a long response can cost far more than a long prompt that produces a short one.

How do I estimate my monthly LLM bill?

Estimate the average input and output tokens per call, compute the per-call cost with the cost = tokens ÷ 1,000,000 × price-per-million formula for each leg, then multiply by your daily call volume and by 30. For instance, a $0.017 call at 500 calls per day is about $8.50 per day, or roughly $255 per month. Then add a buffer for evaluation runs, embeddings, and retries — a useful rule of thumb is 25–30% on top of the raw API estimate.

Can I reduce LLM cost without changing the model?

Yes — the largest lever is usually shortening output, since output tokens dominate spend. Trimming a bloated system prompt, capping max output length, and reusing repeated context via prompt caching all lower the token count you pay for. Routing high-output, lower-precision tasks to a cheaper model while reserving a frontier model for short, quality-critical outputs is the other big lever, and it can cut total cost by 40–70% without a noticeable quality drop.

Should I use a frontier model or a cheaper model?

Match the model to the shape of the task, not to its prestige. Reserve expensive frontier models for work where quality is critical and the output is short — classification, extraction, short-answer Q&A, and moderation are cheap on any model because they emit few tokens. Send high-output, lower-precision work such as long-form drafts to a cheaper mid-tier model. A mixed portfolio of models at different price points almost always beats running everything on the most capable model.

Does the calculator account for input and output separately?

Yes. Because input and output tokens price at different rates, any accurate estimate has to model the two legs independently, and the LLM Cost Calculator on this site does exactly that. It also lets you model a mixed portfolio of calls across different models and price points, which is the realistic case for a production app that routes work between a frontier model and a cheaper mid-tier model.

Is per-million-token pricing the same across providers?

No — Anthropic, OpenAI, Google, and others each publish their own per-million-token rates, and those rates differ by model tier and change over time. The mechanics are the same everywhere (input and output priced separately, output costing more), but the numbers are not, so always quote a provider and a date when you cite a price. The primary sources listed at the bottom of this article are the providers' official pricing pages — check them for current rates before budgeting.

Token Cost

LLM cost per million tokens explained: input, output, and caching

Updated May 23, 2026 · Byron Malone

LLM APIs charge per million tokens (input and output separately). A $3/M input rate means one million input tokens costs $3. A typical API call (1,500 input + 400 output tokens) with Claude 3.5 Sonnet ($3/M in, $15/M out) costs $0.0105. At 10,000 calls/day, that's $105/day or $3,150/month — before user growth, model upgrades, or architectural inefficiencies.

How it’s calculated

cost (one leg) = tokens ÷ 1,000,000 × price-per-million

  Input cost   = input tokens  ÷ 1,000,000 × input price/M
  Output cost  = output tokens ÷ 1,000,000 × output price/M
  Call cost    = input cost + output cost

Monthly cost  = call cost × calls/day × 30

Worked example (dated, illustrative rates):
  3,650 input × $3/M   = $0.01095
    400 output × $15/M = $0.00600
  ----------------------------------
  Call cost            = $0.01695
  × 500 calls/day      = $8.48/day ≈ $254/month

Assumptions:per-million-token rates are provider-specific (Anthropic, OpenAI, Google, and others each publish their own) and they change over time, so every dollar figure on this page is a dated example, not a current quote. Input and output tokens are always priced separately, and output is typically 3–5× the input rate. Verify current pricing against the providers’ official pricing pages linked at the bottom of this article before budgeting, and add roughly 25–30% on top of the raw API estimate to cover evaluation runs, embeddings, and retries.

What is a token and why it's priced the way it is

A token is the unit of text that language models process. Tokenization splits text into subword units using algorithms like Byte Pair Encoding (BPE). In English, a token is approximately 4 characters or 0.75 words. Common benchmarks:

- 'Hello, world!' = 4 tokens - A 100-word paragraph = ~130 tokens - A 500-word system prompt = ~650 tokens - A 2,000-word document = ~2,600 tokens - A 10,000-token context window ≈ 7,500 words ≈ 30 pages

Tokenization varies slightly between models. GPT and Claude models tokenize similarly for English text; both tokenize non-English text at higher rates (more tokens per word for languages like Chinese, Japanese, Arabic).

Why this pricing model? Token count is the primary cost driver for LLM inference at the infrastructure level. Input tokens require computation (attention over all input tokens in parallel). Output tokens require sequential generation (one token at a time, each requiring a forward pass). Output tokens are 3-5x more expensive than input tokens for this reason — the pricing reflects the compute asymmetry.

The real cost of a typical LLM call in production

Let's work through a concrete example for a customer service chatbot:

System prompt (product knowledge, tone guidelines, safety rules): 3,000 tokens Conversation history (3 turns): 500 tokens User message: 150 tokens Total input: 3,650 tokens Generated response: 400 tokens

Cost using Claude 3.5 Sonnet ($3/M input, $15/M output): Input: 3,650/1,000,000 × $3 = $0.01095 Output: 400/1,000,000 × $15 = $0.006 Total: $0.01695/call

At 500 concurrent users, averaging 5 messages per session, 200 sessions/day: 500 calls/day × $0.01695 = $8.48/day = $254/month for 200 daily active users Cost per DAU: $254/200 = $1.27/month

Scaling to 10,000 DAU: $12,700/month in LLM costs alone. At $29/month subscription, gross margin needs to absorb $1.27 per user in LLM costs — 4.4% of subscription revenue going to LLM. Sustainable. The picture changes at $5/DAU, which forces a rethink of the architecture or pricing.

Input vs output token cost asymmetry: architectural implications

The 3-5x price premium for output tokens has real architectural implications:

High output-cost use cases (avoid with expensive models if possible): - Long-form content generation (blog posts, reports, code) - Multi-step chain-of-thought reasoning that generates extensive intermediate text - Conversational AI where users expect lengthy responses

Low output-cost use cases (can use expensive models economically): - Classification (output: one label, ~5 tokens) - Entity extraction (output: structured JSON of extracted entities, ~50-200 tokens) - Short-answer Q&A (output: 1-3 sentences, ~50-100 tokens) - Moderation/safety checks (output: yes/no decision)

Architectural strategy: use expensive frontier models (GPT-4o, Claude 3.5 Sonnet) for tasks where quality is critical AND output is short. Use cheaper mid-tier models (GPT-4o mini, Haiku) for high-output, lower-precision tasks. This 'routing' approach can cut total LLM costs by 40-70% without meaningfully degrading user experience.

The LLM Cost Calculator lets you model a mixed portfolio of calls with different models at different price points.

Total cost of ownership: what teams miss in their LLM budget

Direct API costs are only part of the LLM cost picture. Often-missed line items:

1. Evaluation runs: testing prompt changes on 500+ samples before deployment. A $0.015/call eval on 1,000 samples = $15/eval cycle. If you run 10 eval cycles per product sprint, that's $150/sprint — not trivial at scale.

2. Embedding API calls for vector search in RAG: OpenAI text-embedding-3-small at $0.02/M tokens is cheap but accumulates. Embedding 100,000 documents of 500 tokens each = 50M tokens = $1.00 for the initial corpus. Re-embedding on updates adds ongoing cost.

3. Retry and error handling overhead: production LLM applications experience 1-3% error rates (rate limits, timeouts, malformed outputs). Retry logic adds 1-3% to effective token volume.

4. Prompt engineering iteration: development phase where prompt engineers experiment with dozens of variants before finding the optimal prompt. This testing cost often isn't budgeted.

5. Model upgrade migrations: when a new model version is released, you need to re-evaluate your prompts (older prompt patterns may not work identically) — another eval run cycle.

Budget recommendation: add 25-30% to your production API cost estimate to cover these ancillary costs.

In my experience, the line item that surprises teams is almost never the input prompt — it’s the output. I’ve seen production bills where output tokens were 70–80% of total spend even though the prompts were far longer than the responses, simply because output is generated one token at a time and priced at a multiple of input. The cheapest reliable win I’ve found is to cap and trim output before touching the model choice: a verbose customer-support assistant emitting 600-token replies costs the same as one emitting 200-token replies times three, so tightening the response spec often cuts the bill by half with no perceptible quality loss.

Frequently asked questions

By Byron MaloneLast verified May 2026

Founder & Editor, Bedrocka Tools

Try the calculator

This article pairs with theLLM Cost Calculator — which operationalizes the concepts above with your specific numbers.

Primary sources cited

Sources: every per-token rate referenced on this page is taken from the providers’ official, dated API pricing pages — Anthropic Claude API Pricing, OpenAI API Pricing, and Google Gemini API Pricing. Rates change over time, so each price quoted above is a dated example — verify the current number against the provider’s page before budgeting.

The cost math behind the paired calculator is open source and independently verifiable: View source on GitHub. For sourcing standards, our review process, and the correction policy, read the full methodology. Written and reviewed by Byron Malone.