methodology
How we measure
Our whole value is being right about what things cost. So here is exactly where every number comes from, how we reconcile sources, and what we don't claim. Last snapshot : 298 models, 96 priced from first-party lists, 202 from the routed market.
Two sources, on purpose
No single source is both broad and authoritative. So we use two and tell you which you're looking at on every row.
- routed — OpenRouter. The live market price aggregated across providers. Source of truth for open-weight models, fine-tunes, and any model resold by third parties. New catalog models are admitted automatically.
- — first-party. For the four
closed providers (OpenAI, Anthropic, Google, xAI) every model is sole-source — no one resells
o1-pro— so OpenRouter passes through the provider's own price, and we treat their whole proprietary lineup as first-party list. (Open-weight models under those names —gpt-oss,gemma— are community-served, so they stay routed.) - first-party — reconciled override. DeepSeek and Mistral also ship open weights that third parties resell cheaper, so for them we curate the official list price and reconcile it against OpenRouter every run. When they differ beyond tolerance, the first-party price wins and the row is tagged here.
How we reconcile them
For every premier model we compare the OpenRouter price against the curated first-party list price on each run, using a deterministic model-id map (no fuzzy name matching):
- Within 0.5% → they agree; we keep the market price live and tag the row . (Sub-percent gaps are just rounding.)
- Beyond 0.5% → the first-party price wins and becomes the live value; the row is tagged first-party. We store both numbers and the delta so the choice is auditable.
- Beyond 15% → it also goes to our review queue. A large gap usually means a reseller is routing the model cheaper than the lab's own API — worth knowing, not an error.
Routine catalog growth never floods that queue: only premier price changes, large premier divergences, and outright fetch failures are flagged for a human.
How often, and the history
Prices are refreshed on an alternate-day cadence. Each refresh appends a dated snapshot to an append-only history table — rows can be added but never edited or deleted, enforced at the database level. That archive is the point: it gets more valuable every day it runs, and it lets us show how the cost of intelligence falls over time. Because the cadence is every other day, the history has gaps by design — it is a time series of capture dates, not a guaranteed row per calendar day.
What we don't claim
- Routed prices are market prices and can differ from a lab's own API. We curate first-party prices only for the premier set; we'd rather mark a price "routed" than pretend a reseller's rate is the lab's official one.
- These are list prices for standard usage. Batch discounts, cached-input rates, tiered long-context pricing, and volume deals aren't modelled in the headline number.
- Context windows come from provider/model documentation and can change.
- Everything carries a capture date. Prices move without notice — verify against the provider before you rely on a figure.
Measured cost, not just list price
List prices are only half the story — a model that costs less per token can cost more to finish a task if it generates more tokens. That's why we also run a fixed task battery across the core models and publish the real cost-to-complete on the cost-per-task index. Sourced list prices tell you the rate; measured cost-per-task tells you the bill.
Related
- Model comparison — the full catalog with source tags.
- API cost calculator — your monthly cost.
- Price history — the append-only archive over time.