cost per task

What it actually costs to finish the job

List prices lie by omission. We run a fixed battery of real tasks across the core models, count every token actually consumed — input, output, and hidden reasoning — and publish the true cost to complete each one. The cheapest headline rate is rarely the cheapest finished job.

last measured run
6 models · 4 tasks ·
total measured spend, this run
$0.0595
cheaper to finish pricier to finish cost = real tokens × current price

Summarize

6 models

Condense a passage to three bullet points.

model latency* cost to finish
Gemini 2.5 Flash 66 in · 43 out 1.4s
$0.00013
GPT-5.4 mini 77 in · 62 out 1.7s
$0.00034
Claude Haiku 4.5 79 in · 60 out 2.1s
$0.00038
GPT-5.5 77 in · 76 out 2.5s
$0.0027
Claude Opus 4.8 112 in · 85 out 2.4s
$0.0027
Gemini 3.1 Pro 66 in · 296 out 5.1s
$0.0037

Extract

6 models

Pull structured JSON (name, email, company) from text.

model latency* cost to finish
Gemini 2.5 Flash 38 in · 40 out 696ms
$0.00011
GPT-5.4 mini 45 in · 25 out 1.1s
$0.00015
Claude Haiku 4.5 47 in · 46 out 2.2s
$0.00028
Claude Opus 4.8 67 in · 48 out 2.4s
$0.0015
GPT-5.5 45 in · 71 out 4.2s
$0.0024
Gemini 3.1 Pro 38 in · 196 out 7.5s
$0.0024

Code

6 models

Write an iterative Fibonacci function.

model latency* cost to finish
GPT-5.4 mini 30 in · 75 out 1.4s
$0.00036
Gemini 2.5 Flash 19 in · 183 out 1.5s
$0.00046
Claude Haiku 4.5 28 in · 248 out 2.2s
$0.0013
GPT-5.5 30 in · 84 out 4.9s
$0.0027
Claude Opus 4.8 40 in · 119 out 2.4s
$0.0032
Gemini 3.1 Pro 19 in · 496 out 8.6s
$0.0060

Reason

6 models

Solve a multi-step word problem step by step.

model latency* cost to finish
GPT-5.4 mini 63 in · 157 out 1.6s
$0.00075
Gemini 2.5 Flash 57 in · 333 out 2.9s
$0.00085
Claude Haiku 4.5 64 in · 328 out 3.1s
$0.0017
GPT-5.5 63 in · 172 out 4.5s
$0.0055
Gemini 3.1 Pro 57 in · 796 out 9.2s
$0.0097
Claude Opus 4.8 75 in · 401 out 5.2s
$0.0104

Cost is the headline: real token usage × the model's current price. *Latency is a snapshot from the last run, not a live or averaged benchmark. JSON: /cost-per-task. How we measure →

Why measured, not list price

Two models at the same headline rate can cost wildly different amounts to finish the same task, because they don't generate the same number of tokens. A terse model answers in 120 tokens; a reasoning model thinks for 800 before it starts. List pricing hides that entirely. The only honest way to compare is to run the task and count what was actually consumed — which is what this page does.

How it works

A fixed battery — summarize, extract, code, reason — runs across a small core set of models on a schedule. For each run we record the real input, output, and reasoning tokens, then compute cost from the model's current price in our catalog. No estimates, no token guesses. The battery and model set start small and grow; every figure is a real measurement with a date. See the methodology for the full approach.

Frequently asked questions

How is this different from the cost calculator?

The calculator multiplies list prices by token counts you guess. This page uses the tokens models actually consume — we run the task and count every input, output, and hidden reasoning token. A model with a cheap headline rate that thinks for 800 tokens before answering shows its true cost here.

Which models and tasks are measured?

A small core set across a fixed battery — summarize, extract structured data, write code, and reason through a problem — run on a schedule. The set is deliberately small to start and expands as traffic justifies; the measurement is real every time, never a list-price estimate.

Is the latency figure live?

No. Latency is a snapshot recorded during the last run — a rough indicator of responsiveness on that run, not a live or averaged benchmark. Network conditions, routing and load all move it. Cost is the figure to trust here; latency is context.

Related