guide
LLM pricing explained: input, output, cached and reasoning tokens
Every provider quotes a price per million tokens. That single number hides four different rates and one big asymmetry — and the headline figure is the one least likely to predict your bill.
Updated · 8 min read
LLM APIs bill by the token — chunks of text roughly three-quarters of a word each. You pay for both directions: the tokens you send (input) and the tokens the model generates (output). Prices are quoted per million tokens, and the number you see first is usually the input rate, which is the cheaper and less important of the two.
The asymmetry that runs your bill
Output costs three to five times more than input at every major provider. The reason is mechanical: input is processed in one parallel pass, while output is generated one token at a time, each token requiring a full pass over the model. So the ratio of output to input in your workload often matters more than either headline price. A summarizer reads a lot and writes a little — it's input-bound, and a low input rate wins. A code generator or a chat assistant writes a lot relative to what it reads — it's output-bound, and the cheapest model can flip entirely.
Four rates hiding in one number
1. Input tokens
Everything you send: the prompt, the system message, conversation history, retrieved documents, tool definitions. It all counts, every request. Long context is convenient but you pay input rates for every token you put in the window — a 100K-token document attached to each call is 100K billed input tokens each time.
2. Output tokens
Everything the model writes. The pricey direction. If you can get the same result in fewer output tokens — tighter prompts, response-length limits, structured output instead of prose — you save real money.
3. Cached input
Reuse the same prefix across calls and most providers will charge a large discount — commonly around 90% off — for the repeated portion, because they can cache its processed form. It applies only to the shared prefix and the cache expires after a few minutes idle, but for agents and chats with a fixed system prompt it's one of the biggest savings available, and it's invisible on the headline rate.
4. Reasoning / "thinking" tokens
Reasoning models generate a hidden chain of thought before answering. Those tokens are billed as output, at the output rate, even though you never see them. A model that thinks for 2,000 tokens to write a 200-token answer bills you for 2,200 output tokens. This is the single biggest reason list prices mislead: a cheap-looking reasoning model can cost more per finished task than an expensive model that answers directly.
Why list price ≠ real cost
A price per million tokens tells you the rate, not the bill. The bill is rate × tokens, and verbose or reasoning-heavy models simply emit more tokens to do the same job. Two models at the same output rate can differ two- or three-fold in what it costs to actually finish a task. The only way to know is to run the task and count the tokens it really consumes — which is exactly what our measured cost-per-task index does, instead of trusting the headline number.
What the calculator leaves out
Standard per-token list pricing — what most comparisons (including our calculator's default) use — omits several real-world modifiers:
- Batch discounts — roughly 50% off for asynchronous, non-urgent jobs.
- Cached input — the ~90%-off repeated-prefix rate described above.
- Tiered long-context pricing — some models charge a higher rate once a prompt crosses a size threshold (commonly 128K or 200K tokens).
- Surcharges — image and audio inputs, fine-tuned model hosting, and tool/function overhead beyond the tokens themselves.
None of these change the core lesson: estimate with your real token counts, weight output heavily, and treat a single price-per-million as a starting point, not an answer.
Prices move — and the direction is down
A model's price is a snapshot, not a constant. New models routinely ship cheaper than the ones they replace, and existing models occasionally get price cuts, so the cost of a given level of capability falls over time — sometimes sharply, in step-changes rather than smooth declines. That matters for any decision with a long shelf life: a workload that's marginal today may be comfortably affordable in a few months on a newer model, and a model you standardized on a year ago may now be undercut by a cheaper one that's just as good for your task. It pays to recheck rather than assume. Our price-history archive tracks these moves on an append-only basis, including a headline "cost of intelligence" line that follows the cheapest capable model over time — the clearest picture of how fast the floor is dropping.
A quick worked estimate
Say a chat feature averages 1,500 input and 500 output tokens per request, at 200,000 requests a month.
On a model priced at $2.50/1M input and $10/1M output, that's
(1,500 ÷ 1M × $2.50) + (500 ÷ 1M × $10) = $0.02 per request, or $4,000/month.
Halve the output length and you don't halve the bill — but you cut the larger half of it. Switch to a
model with a low input rate and it barely moves, because this workload is output-bound. Change the
numbers for your case in the calculator and watch the ranking reorder.
Frequently asked questions
Why is output priced higher than input?
Generating tokens is more expensive than reading them. Input tokens are processed in parallel in a single forward pass (the 'prefill'), while output tokens are produced one at a time, each requiring a full pass over the model and growing the KV cache. That sequential 'decode' step dominates cost and latency, so providers price output three to five times higher than input across the board.
Do reasoning or 'thinking' tokens cost extra?
They're billed as output tokens, at the output rate, even when the model hides them from you. A model that 'thinks' for 2,000 tokens before writing a 200-token answer bills you for 2,200 output tokens. This is why a model with a low headline output rate can still cost more per finished task than a pricier model that answers concisely — the only way to know is to measure real token usage.
What is cached input pricing?
When you reuse the same prefix across requests — a long system prompt, a document, a few-shot example block — providers can cache the processed form of those tokens and charge a steep discount (often around 90% off) for the cached portion on subsequent calls. It only applies to the repeated prefix, and the cache expires after minutes of inactivity, but for agents and chat with a fixed preamble it can cut input costs dramatically.
Related
- API cost calculator — your monthly cost across every tracked model
- Token counter — how many tokens your text actually is
- Real cost-per-task — measured cost to finish a job, not list price
- Token costs in practice — worked monthly-cost examples