Run Mixtral 8x22B on NVIDIA L40S

141B parameters on a 48 GB card. It won't fit in this card's memory at any usable quantization — here's the math, and what will run it.

verdict · Mixtral 8x22B on NVIDIA L40S

Won't fit

even Q3_K_M needs 72.8 GB · only 45.6 GB usable

too big for this card — run it in the cloud

Smallest GPU that fits at Q4_K_M: H200 141GB

Rent H200 141GB by the hour

VRAM needed and fit for Mixtral 8x22B on NVIDIA L40S by quantization.
quantization	vram needed	fits 48gb?	tokens/sec	quality
FP16 (full)	338.4 GB	✗ no	—	Reference quality — no quantization loss.
Q8_0	179.4 GB	✗ no	—	Near-lossless; rarely worth the extra space over Q6.
Q6_K	138.7 GB	✗ no	—	Virtually indistinguishable from full precision.
Q5_K_M	116.7 GB	✗ no	—	Minor loss; an excellent quality-vs-size balance.
Q4_K_M	94.8 GB	✗ no	—	Small but measurable loss; the popular default.
Q3_K_M	72.8 GB	✗ no	—	Noticeable degradation; only when you're tight on VRAM.

Weights = params × bytes/weight, +20% for KV cache & runtime; usable VRAM is 95% of nameplate. Tokens/sec is a bandwidth ceiling (864 GB/s) — real throughput is lower with long context. Try other combinations →

Why Mixtral 8x22B won't fit a NVIDIA L40S

Mixtral 8x22B has 141 billion parameters. Even quantized hard to Q3_K_M, its weights plus KV-cache overhead come to roughly 73 GB, well past the NVIDIA L40S's 46 GB of usable VRAM. Spilling the overflow to system RAM (CPU offload) works but can cut throughput tenfold. To run it at the popular Q4_K_M default you'd want a H200 141GB or larger — or rent one by the hour rather than buy.

Quantization is the lever

Each step down in precision shrinks the model: FP16 needs about 338 GB, Q4_K_M about 95 GB — a 72% reduction for a small, usually acceptable quality cost. Q5_K_M and Q6_K are near-lossless if you have the headroom; drop to Q3 only when you're genuinely out of VRAM. The quantization guide covers the tradeoffs in detail.

Frequently asked questions

Can a NVIDIA L40S run Mixtral 8x22B?

Not in VRAM. Even at the smallest practical quantization (Q3_K_M), Mixtral 8x22B needs about 73 GB versus the NVIDIA L40S's 46 GB usable. You'd need a larger GPU — a H200 141GB fits it at Q4_K_M, or to rent one by the hour.

How much VRAM does Mixtral 8x22B need?

Mixtral 8x22B is a 141B-parameter model. At FP16 that's about 338 GB; at Q4_K_M (the popular default) about 95 GB, including ~20% for the KV cache and runtime. Quantization is the main lever — see the per-quant table above.

Other combinations

Mixtral 8x22B on other GPUs: RTX A6000, A100 40GB, A100 80GB, H100 80GB, H200 141GB, Apple M3 Max 128GB

Other models on the NVIDIA L40S: Gemma 2 27B, Qwen2.5 32B, Command R 35B, Mixtral 8x7B, Llama 3.3 70B, Qwen2.5 72B

VRAM calculator — any model, quant and GPU.
Running models locally — the hardware reality.
All run-locally combinations →