Choosing an LLM — A Decision Framework by Task & Constraint

The right model is rarely the best model — it's the cheapest one that passes your test. Here's a framework that starts from your real constraint and works toward a defensible choice.

There is no single best LLM, only the best one for a task under a constraint. Picking by leaderboard rank is how teams end up paying frontier prices for work a mid-tier model does identically. This guide inverts that: name your hard constraint first, narrow by task type, then default to the cheapest model that passes a real test — escalating only when it measurably fails.

Have candidates in mind? Put them side by side: Compare models →

Step 1 — Start from the binding constraint

One requirement usually dominates and eliminates most options before quality even enters the picture. Identify yours:

Privacy / data residency. If data can't leave your environment, you're choosing among open models you host yourself — see running models locally. This decides the question on its own.
Budget at volume. High request volume makes the price-per-token gap between tiers enormous. If cost is binding, you're shopping the cheap and mid tiers and proving the expensive ones aren't needed.
Latency. Interactive UX needs fast first-token and high throughput; that favors smaller models and rules out heavy reasoning models that think for thousands of tokens first.
Context size. If you must fit an entire codebase or a long document set in one prompt, the field narrows to large-context models — and remember you pay input rates for every token you load.
Quality ceiling. Only when the task genuinely needs frontier capability (hard reasoning, novel code, nuanced writing) does "the best model" become the right starting point.

Step 2 — Match the task type

Different jobs reward different things. Use the task to set your shortlist:

task	what matters	tier to start
Classification / routing / extraction	Reliability on a narrow format; cost at volume	Budget
Summarization / RAG answers	Input price (input-bound), faithfulness	Budget → Mid
General chat / drafting	Output price, tone, latency	Mid
Code generation / editing	Correctness; quantization & quality sensitivity	Mid → Flagship
Hard reasoning / math / agents	Raw capability; watch reasoning-token cost	Flagship
Vision / audio	Modality support first, then the above	Multimodal only

Step 3 — Default cheap, escalate on evidence

This is the rule that saves the most money: start with the cheapest model that could plausibly work, and only move up when it fails a test you can point to. The price gap between tiers is 10–50×, while the quality gap on many production tasks is small or zero. The expensive model is worth it exactly when you can see the difference on your own task — and worth nothing when you can't.

Build a tiny eval before you choose. Twenty to fifty real examples with known-good answers beats any benchmark or vibe. Run your shortlist against it, score correctness, then read the cost next to the score. The decision usually makes itself.

Score more than correctness. A model can be right and still wrong for you if it's too slow, ignores your output format, refuses reasonable requests, or rambles when you need brevity. Track the things that actually break your product: format adherence (does it return valid JSON every time?), latency at your prompt size, refusal rate on legitimate inputs, and output length — that last one feeds straight back into cost. A model that passes correctness but fails format one time in twenty can be more expensive to ship than a slightly less capable one that never breaks the contract.

Step 4 — Consider routing, not just picking

The best production answer is often "more than one model." Route the easy majority of requests to a cheap, fast model and the hard minority to a stronger one — by a lightweight classifier, or by escalating when the cheap model signals low confidence. This cascade pattern routinely cuts cost severalfold versus sending everything to the top model, because most requests are easy and only a few are genuinely hard.

Step 5 — Price it, then verify the real cost

With a shortlist that passes your eval, line the finalists up on price. The calculator gives the monthly figure for your token volumes, and the comparison table filters by context and provider. Then check the measured cost-per-task index: a model with a tempting list price can cost more per finished task if it's verbose or reasoning-heavy, and that only shows up once you count the tokens a real job consumes. List price gets you a shortlist; measured cost makes the final call.

Common mistakes

Choosing by benchmark leaderboard. Public benchmarks measure general capability, not your task. A model that tops a reasoning chart can underperform a cheaper one on your specific extraction format. Your eval beats their benchmark.
Over-provisioning "to be safe." Defaulting to the flagship because it "can't hurt" quietly multiplies your bill for quality you can't observe. It can hurt — on the invoice.
Ignoring output and reasoning tokens. Picking on input price alone, then being surprised by an output-bound or reasoning-heavy bill, is the most common costing error.
Never revisiting the choice. Prices fall and new models ship constantly; a decision made a year ago is probably no longer optimal. Recheck periodically against the price history.
Treating the decision as permanent. Abstract behind a thin interface so you can swap models — the right choice changes, and lock-in is expensive.

A default ladder

When in doubt, climb only as far as you need: a budget model for narrow, high-volume, format-bound tasks; a mid-tier model for general chat, drafting and most retrieval; a flagship only for hard reasoning, novel code, or work where a visible quality gap justifies a multiple on the bill. Most systems live on the bottom two rungs and visit the top one rarely.

Frequently asked questions

Should I default to the most capable model?

No — default to the cheapest model that passes your evaluation, and escalate only when it measurably fails. Frontier models cost 10–50× more than capable mid-tier ones, and for a large share of production tasks (classification, extraction, routing, straightforward chat) the cheaper model is indistinguishable on the work that matters. Reserve the expensive models for the tasks where you can actually see the difference.

How many models should a real system use?

Often more than one. A common, cost-effective pattern is to route by difficulty: a cheap, fast model handles the bulk of requests, and a stronger model handles the hard minority — either by a routing classifier or by escalating when the cheap model signals low confidence. This 'model cascade' can cut cost severalfold versus sending everything to the top model, with little quality loss.

When is a local model the right call?

When privacy or data residency is non-negotiable, when you have steady high volume that makes fixed GPU cost cheaper than per-token API fees, or when you need to run offline. For spiky or low volume, a hosted API is almost always cheaper and far less work than owning hardware. The break-even depends on your tokens per month — model it before buying a GPU.

Model comparison — filter every tracked model by price, context and provider
Compare models — head-to-head pricing and specs for the leading models
API cost calculator — cost of each candidate on your workload
Real cost-per-task — measured cost to finish representative tasks

Choosing a model: a decision framework

Step 1 — Start from the binding constraint

Step 2 — Match the task type

Step 3 — Default cheap, escalate on evidence

Step 4 — Consider routing, not just picking

Step 5 — Price it, then verify the real cost

Common mistakes

A default ladder

Frequently asked questions

Should I default to the most capable model?

How many models should a real system use?

When is a local model the right call?

Related