guide
Token costs in practice: real workloads, real monthly bills
Rates per million tokens are abstract. Monthly bills are not. Here are four common workloads costed end to end — and the levers that actually move each one.
Updated · 9 min read
The way to understand LLM cost is to cost a real workload, not to stare at a rate card. Below are four common shapes — a chat assistant, a document summarizer, a coding agent, and a bulk classifier — each estimated from its token profile. The point isn't the exact dollar figures (plug your own into the calculator); it's seeing which lever moves each one, because they're all different.
For illustration we cost each workload on a representative mid-tier model at $2.50 per 1M input and $10 per 1M output tokens. Cost per request = (input ÷ 1M × $2.50) + (output ÷ 1M × $10).
1. A customer-facing chat assistant
Profile: ~2,000 input tokens (system prompt + short history) and ~600 output tokens per turn, at 300,000 turns/month. Cost: input is 2,000 ÷ 1M × $2.50 = $0.005; output is 600 ÷ 1M × $10 = $0.006; total $0.011/turn → ~$3,300/month. Notice output already outweighs input despite being fewer tokens — that's the 4× rate asymmetry. Levers: cap and tighten responses (the expensive direction), and cache the fixed system prompt for the ~90% cached-input discount, which here would cut most of the input half of the bill.
2. A document summarizer / RAG answerer
Profile: ~12,000 input tokens (a retrieved document set) and ~400 output tokens per request, at 100,000 requests/month. Cost: input is 12,000 ÷ 1M × $2.50 = $0.030; output is 400 ÷ 1M × $10 = $0.004; total $0.034/request → ~$3,400/month. This one is input-bound — the opposite of the chatbot — so the input rate dominates and a model with cheap input wins outright. Levers: retrieve less aggressively (fewer, better chunks), and reuse a cached document prefix across follow-up questions about the same source.
3. A coding agent
Profile: coding agents are output-heavy and often reasoning-heavy. Say ~8,000 input tokens (files + instructions) and ~3,000 output tokens (a diff or new code), at 40,000 runs/month — and if the model "thinks" before writing, add those hidden tokens to output. Cost (no hidden reasoning): input 8,000 ÷ 1M × $2.50 = $0.020; output 3,000 ÷ 1M × $10 = $0.030; total $0.050/run → ~$2,000/month. Now add 2,000 reasoning tokens per run, billed as output: output becomes 5,000 tokens = $0.050, total $0.070/run → ~$2,800/month, a 40% jump from tokens you never see. Lever: this is where list price misleads most — measure the real token usage on cost-per-task before assuming a low-rate reasoning model is cheap.
4. A high-volume classifier
Profile: ~400 input tokens and ~10 output tokens (a label) per item, at 5,000,000 items/month. Cost: input 400 ÷ 1M × $2.50 = $0.001; output 10 ÷ 1M × $10 = $0.0001; total ~$0.0011/item → ~$5,500/month on a mid-tier model. Lever: volume is the whole story here, and the task is easy — drop to a budget model at a fraction of the rate and the bill can fall by 5–10× with no measurable quality loss, plus batch pricing (~50% off) for anything non-urgent. High-volume, narrow tasks are where cheap models earn their keep.
Track spend before it surprises you
An estimate is a starting point, not a guarantee — real usage drifts as prompts grow, retrieval returns more context, and users ask harder questions that trigger more reasoning. Three habits keep the bill honest. Log token usage per request (every API returns input and output token counts in its response) so you can see the real distribution rather than your assumed average — the tail of expensive requests is usually where the money goes. Set a budget alert at the provider so a runaway loop or a prompt-injection that balloons output can't quietly cost a fortune overnight. Re-cost periodically as your traffic shape changes and as prices move. The workloads above are illustrations; your logs are the truth, and they often reveal that a small fraction of requests drives most of the spend — which is exactly where caching, output caps, or a cheaper model pay off most.
The pattern across all four
Four workloads, four different optimal moves: cap output for the chatbot, cut input for the summarizer, measure reasoning tokens for the coding agent, downscale the model for the classifier. The unifying lessons:
- Weight output heavily. It's priced several times higher than input, so the read/write ratio of your workload decides which model is actually cheapest.
- Reasoning tokens are billed output you can't see. They can swing a bill 40% or more — never trust a reasoning model's headline rate without measuring.
- Caching and batching are large, underused discounts. ~90% off repeated input, ~50% off async batches.
- Match the model to the task's difficulty. Most volume is easy; reserve expensive models for the requests where you can see the difference.
Estimate yours by writing one representative request and answer, counting tokens (the token counter calibrates this on real text), and multiplying by volume in the calculator. Then sanity-check the finalists against measured cost-per-task — because the cheapest rate and the cheapest finished job are not always the same model.
Frequently asked questions
How do I estimate tokens before I've built anything?
Use the rule that one token is about 0.75 words, or four characters of English. Write a representative prompt and a representative answer, count the words, divide by 0.75, and you have a per-request estimate. Paste real text into a token counter to calibrate — tokenizers differ between models, and code, JSON and non-English text run more tokens per word than prose.
Which costs more: a chatbot or a summarizer?
Per request, it depends on the read/write split. A summarizer reads a large document and writes a short summary — it's input-bound, so a low input rate matters most. A chatbot writes long answers relative to what it reads — it's output-bound, where the higher output rate dominates. At equal request volume the output-bound workload usually costs more, because output is priced several times higher than input.
What's the easiest way to cut LLM costs?
In order of impact: shorten outputs (it's the expensive direction), cache repeated prefixes like long system prompts for the ~90% cached-input discount, use batch pricing for non-urgent jobs (~50% off), and route easy requests to a cheaper model. Switching to a model with a lower input rate helps only if your workload is input-bound — measure before you assume.
Related
- API cost calculator — plug in these numbers and compare every model
- Real cost-per-task — measured cost to finish a task, not list price
- Token counter — calibrate token counts on your real text
- LLM pricing explained — the rates behind these numbers