run it locally
Run Qwen2.5 32B on RTX 4070 Ti
32.5B parameters on a 12 GB card. It won't fit in this card's memory at any usable quantization — here's the math, and what will run it.
| quantization | vram needed | fits 12gb? | tokens/sec | quality |
|---|---|---|---|---|
| FP16 (full) | 78.0 GB | ✗ no | — | Reference quality — no quantization loss. |
| Q8_0 | 41.3 GB | ✗ no | — | Near-lossless; rarely worth the extra space over Q6. |
| Q6_K | 32.0 GB | ✗ no | — | Virtually indistinguishable from full precision. |
| Q5_K_M | 26.9 GB | ✗ no | — | Minor loss; an excellent quality-vs-size balance. |
| Q4_K_M | 21.8 GB | ✗ no | — | Small but measurable loss; the popular default. |
| Q3_K_M | 16.8 GB | ✗ no | — | Noticeable degradation; only when you're tight on VRAM. |
Weights = params × bytes/weight, +20% for KV cache & runtime; usable VRAM is 95% of nameplate. Tokens/sec is a bandwidth ceiling (504 GB/s) — real throughput is lower with long context. Try other combinations →
Why Qwen2.5 32B won't fit a RTX 4070 Ti
Qwen2.5 32B has 32.5 billion parameters. Even quantized hard to Q3_K_M, its weights plus KV-cache overhead come to roughly 17 GB, well past the RTX 4070 Ti's 11 GB of usable VRAM. Spilling the overflow to system RAM (CPU offload) works but can cut throughput tenfold. To run it at the popular Q4_K_M default you'd want a RTX 3090 or larger — or rent one by the hour rather than buy.
Quantization is the lever
Each step down in precision shrinks the model: FP16 needs about 78 GB, Q4_K_M about 22 GB — a 72% reduction for a small, usually acceptable quality cost. Q5_K_M and Q6_K are near-lossless if you have the headroom; drop to Q3 only when you're genuinely out of VRAM. The quantization guide covers the tradeoffs in detail.
Frequently asked questions
Can a RTX 4070 Ti run Qwen2.5 32B?
Not in VRAM. Even at the smallest practical quantization (Q3_K_M), Qwen2.5 32B needs about 17 GB versus the RTX 4070 Ti's 11 GB usable. You'd need a larger GPU — a RTX 3090 fits it at Q4_K_M, or to rent one by the hour.
How much VRAM does Qwen2.5 32B need?
Qwen2.5 32B is a 32.5B-parameter model. At FP16 that's about 78 GB; at Q4_K_M (the popular default) about 22 GB, including ~20% for the KV cache and runtime. Quantization is the main lever — see the per-quant table above.
Other combinations
Qwen2.5 32B on other GPUs: RTX 3060 12GB, RTX 4080, RTX 3090, RTX 4090, NVIDIA L4, RTX A6000
Other models on the RTX 4070 Ti: Llama 3.1 8B, Gemma 2 9B, Mistral 7B, Qwen2.5 7B, Phi-3 Medium 14B, Gemma 2 27B
Related
- VRAM calculator — any model, quant and GPU.
- Running models locally — the hardware reality.
- All run-locally combinations →