Question 1

How much VRAM do I need to run a model?

Accepted Answer

Roughly: parameters (in billions) × bytes-per-weight for your quantization, plus about 20% for the KV cache, activations and runtime. A 70B model at Q4_K_M (≈0.56 bytes/weight) needs ~47 GB; at FP16 it needs ~168 GB. Long contexts add more for the KV cache. This calculator does that math for you.

Question 2

Which quantization should I use?

Accepted Answer

Q4_K_M is the popular default — a small, usually acceptable quality hit for roughly a quarter of FP16's size. Q5_K_M or Q6_K give near-lossless quality if you have the headroom. Drop to Q3 only when you're genuinely out of VRAM; the degradation becomes noticeable. Q8 is near-lossless but rarely worth the space over Q6.

Question 3

Why is my tokens/sec lower than the estimate?

Accepted Answer

The estimate is a memory-bandwidth ceiling (decode reads the weights once per token), scaled by ~0.85. Real throughput is lower with long contexts, small batch sizes, CPU offload, or thermal throttling, and higher with speculative decoding or batching. Treat it as an order-of-magnitude guide, not a benchmark.

Question 4

Can I run a model that doesn't fit in VRAM?

Accepted Answer

Yes, but slowly. Layers that don't fit spill to system RAM (CPU offload), which can drop throughput by 10× or more because CPU memory bandwidth is far lower. For interactive use, fitting the whole model in VRAM — or renting a bigger GPU by the hour — is usually worth it.

Can you run it locally?

How the estimate works

Quantization is the main lever

Speed is bound by memory bandwidth

Frequently asked questions

How much VRAM do I need to run a model?

Which quantization should I use?

Why is my tokens/sec lower than the estimate?

Can I run a model that doesn't fit in VRAM?

Related