vram calculator
Can you run it locally?
Pick a model, a quantization, and your GPU. We'll tell you whether the weights fit in VRAM, roughly how many tokens per second to expect, and what you trade for it. If it won't fit, we'll show you what will.
Estimate: weights = params × bytes/weight, +20% for KV cache & runtime; usable VRAM is 95% of nameplate (72% for Apple unified memory). Long contexts need more. Quantization explained →
How the estimate works
Every weight has to live in memory. At full precision (FP16) that's 2 bytes per parameter, so a 70-billion-parameter model is ~140 GB of weights alone. Quantization shrinks each weight: Q8 is ~1 byte, Q4_K_M ~0.56 bytes, Q3 ~0.43. On top of the weights you need headroom for the KV cache (which grows with context length), activations, and the runtime — we add ~20%. The result is the VRAM you actually need; compare it to your GPU's usable memory and you have your answer.
Quantization is the main lever
Dropping from FP16 to Q4_K_M cuts VRAM by ~72% for a small, usually acceptable quality cost — the difference between a 70B model needing a data-center GPU and fitting on a single 48 GB card. Q5/Q6 are near-lossless if you have the room; Q3 is a last resort. See the quantization guide for the quality tradeoffs in detail.
Speed is bound by memory bandwidth
Token generation reads the whole model from memory once per token, so tokens/sec is roughly the GPU's memory bandwidth divided by the model's size in VRAM. That's why an H100 (3.35 TB/s) is far faster than a 4090 (1 TB/s) on the same model, and why Apple's unified memory — large but lower bandwidth — fits big models but runs them slower. Our estimate is a ceiling; real throughput depends on context length, batching, and offload.
Frequently asked questions
How much VRAM do I need to run a model?
Roughly: parameters (in billions) × bytes-per-weight for your quantization, plus about 20% for the KV cache, activations and runtime. A 70B model at Q4_K_M (≈0.56 bytes/weight) needs ~47 GB; at FP16 it needs ~168 GB. Long contexts add more for the KV cache. This calculator does that math for you.
Which quantization should I use?
Q4_K_M is the popular default — a small, usually acceptable quality hit for roughly a quarter of FP16's size. Q5_K_M or Q6_K give near-lossless quality if you have the headroom. Drop to Q3 only when you're genuinely out of VRAM; the degradation becomes noticeable. Q8 is near-lossless but rarely worth the space over Q6.
Why is my tokens/sec lower than the estimate?
The estimate is a memory-bandwidth ceiling (decode reads the weights once per token), scaled by ~0.85. Real throughput is lower with long contexts, small batch sizes, CPU offload, or thermal throttling, and higher with speculative decoding or batching. Treat it as an order-of-magnitude guide, not a benchmark.
Can I run a model that doesn't fit in VRAM?
Yes, but slowly. Layers that don't fit spill to system RAM (CPU offload), which can drop throughput by 10× or more because CPU memory bandwidth is far lower. For interactive use, fitting the whole model in VRAM — or renting a bigger GPU by the hour — is usually worth it.
Related
- Run models locally — worked-out fit guides for popular model-and-GPU pairs.
- Running models locally — the hardware reality.
- Quantization explained — Q4 vs Q5 vs Q8 quality.
- Model comparison — hosted models and their prices.