guide
Quantization explained: Q4 vs Q5 vs Q8 vs FP16
Quantization is the single biggest lever on whether a model fits your GPU — and the one most likely to be misunderstood. Here's what each level costs you in quality, and which to actually use.
Updated · 7 min read
A large language model is a few billion numbers — its weights. At full precision each weight is a 16-bit float (FP16), so a 7-billion-parameter model is about 14 GB of numbers, and a 70B model is about 140 GB. Quantization stores those same weights using fewer bits each. Drop from 16 bits to roughly 4 and the model shrinks by about three-quarters — which is the difference between needing a data-center GPU and running on a card you already own.
The catch: fewer bits, less precision
Compressing each weight throws away detail. A 4-bit number can represent only 16 distinct values, so the model's parameters get rounded to a coarser grid. Modern methods make that rounding remarkably cheap in quality — but it is never free, and it gets more expensive the lower you go. The useful way to think about it: quantization trades a little accuracy for a lot of memory, and the exchange rate is excellent down to about 4 bits, then turns sharply worse.
The levels, from biggest to smallest
The Q<n>_K_M names come from the GGUF format used by llama.cpp and Ollama. The number
is the nominal bit width; _K means a modern "K-quant" that allocates bits unevenly
(more to the weights that matter); _M is the medium-size variant. Effective bytes-per-weight
is a little above the nominal bits because of that overhead.
| quant | bytes/weight | 70B size | quality |
|---|---|---|---|
| FP16 (full) | 2.0 | ~140 GB | Reference. No quantization loss. |
| Q8_0 | 1.06 | ~74 GB | Near-lossless. Rarely worth the space over Q6. |
| Q6_K | 0.82 | ~57 GB | Virtually indistinguishable from full precision. |
| Q5_K_M | 0.69 | ~48 GB | Minor loss. Excellent quality-vs-size balance. |
| Q4_K_M | 0.56 | ~39 GB | Small but measurable loss. The popular default. |
| Q3_K_M | 0.43 | ~30 GB | Noticeable degradation. Only when tight on VRAM. |
| Q2_K | 0.33 | ~23 GB | Often broken for real work. Avoid unless desperate. |
Sizes are weights only; add roughly 20% for the KV cache, activations and runtime to get the VRAM you actually need.
Where the quality actually goes
The honest summary from the open research and community testing: down to Q5–Q6, the difference from full precision is hard to measure and usually impossible to feel. At Q4 it becomes measurable on hard benchmarks but stays small for everyday tasks. Below Q4 it falls off a cliff — Q3 is a compromise you make to fit a model at all, and Q2 frequently produces a model that looks fine until it confidently gets things wrong. The degradation is not uniform across tasks: chat tolerates quantization well; code, math, and exact structured output are far less forgiving, because a single wrong token breaks the result.
Bigger-model-lower-quant usually wins
A recurring finding is that, at a fixed VRAM budget, you're better off running a larger model quantized harder than a smaller model at high precision. A 70B at Q4 generally outperforms a 34B at Q8, despite using similar memory, because the larger model simply knows more and the Q4 rounding doesn't erase that. Quantize the model, in other words — don't downsize it — until you hit the Q3 floor.
A worked example
Take a 70B model on a 24 GB RTX 4090. At FP16 it needs ~168 GB including overhead — nowhere close. At Q4_K_M it's ~47 GB — still over the 4090's ~23 GB of usable memory, so it won't fit on one card. You'd need ~48 GB of VRAM (an A6000, or two 4090s), or you rent a bigger GPU by the hour. A 32B model, by contrast, lands around 22 GB at Q4 and fits a single 4090 with a short context. The run-locally guides work this out for the common pairings, and the calculator does it for any combination.
Calibrated quants squeeze out more quality
Not all quants at the same bit width are equal. Importance-matrix (imatrix) quants —
the IQ variants and many modern K quants — are calibrated against a sample of
real text so the rounding budget is spent where it matters most, preserving the weights that drive
output and compressing the ones that don't. The practical upshot: an imatrix Q4 or a 4-bit IQ quant
often matches what a naive 5-bit quant used to deliver, which is why the floor where quality collapses
has crept downward over time. When you're choosing a file, prefer a recent K-quant or IQ build over an
old Q4_0-style legacy quant at the same size — you get the smaller footprint without the
older quality penalty. It's the closest thing to a free lunch in this whole tradeoff.
Formats you'll encounter
GGUF (llama.cpp, Ollama, LM Studio) is the default for local single-user inference and is where the K-quant names live; it runs on CPU, GPU, or a split of both. AWQ and GPTQ are 4-bit GPU formats used with high-throughput servers like vLLM; AWQ generally holds quality a little better at 4-bit. bitsandbytes NF4 is the on-the-fly option in the Hugging Face stack, convenient but usually slower than a prepared GGUF or AWQ. For most people reading this, a Q4_K_M or Q5_K_M GGUF is the right first choice.
Frequently asked questions
Is Q4 good enough for production?
For most chat, summarization and retrieval work, yes — Q4_K_M is the popular default precisely because the quality drop is small and often imperceptible in side-by-side use. Where it bites is precision-sensitive work: code generation, math, structured extraction and long multi-step reasoning, where small errors compound. If those matter, test Q5_K_M or Q6_K against Q4 on your own prompts before committing.
What's the difference between GGUF, GPTQ and AWQ?
They're different quantization formats. GGUF (used by llama.cpp and Ollama) is the most common for local CPU/GPU inference and is what the K-quant names like Q4_K_M refer to. GPTQ and AWQ are GPU-focused 4-bit schemes often used with vLLM and ExLlama; AWQ tends to preserve quality slightly better than naive GPTQ at the same bit width. For a single-user local setup, GGUF with a K-quant is the simplest path.
Does quantization make the model faster?
Usually yes, because token generation is bound by memory bandwidth — a smaller model means fewer bytes to read per token. A Q4 model is roughly four times smaller than FP16, so it can decode several times faster on the same GPU, on top of fitting in less VRAM. The tradeoff is the quality cost, which is why the sweet spot for most people is Q4–Q6 rather than the smallest possible quant.
Related
- VRAM calculator — check whether a model fits at each quantization on your GPU
- Run models locally — worked-out fit guides for popular model-and-GPU pairs
- Running models locally — the broader hardware picture