Run Llama 3.1 405B on Apple M2 Ultra 192GB

405B parameters on a 192 GB (unified memory) card. It won't fit in this card's memory at any usable quantization — here's the math, and what will run it.

verdict · Llama 3.1 405B on Apple M2 Ultra 192GB

Won't fit

even Q3_K_M needs 209.0 GB · only 138.2 GB usable

VRAM needed and fit for Llama 3.1 405B on Apple M2 Ultra 192GB by quantization.
quantization	vram needed	fits 192gb?	tokens/sec	quality
FP16 (full)	972.0 GB	✗ no	—	Reference quality — no quantization loss.
Q8_0	515.2 GB	✗ no	—	Near-lossless; rarely worth the extra space over Q6.
Q6_K	398.5 GB	✗ no	—	Virtually indistinguishable from full precision.
Q5_K_M	335.3 GB	✗ no	—	Minor loss; an excellent quality-vs-size balance.
Q4_K_M	272.2 GB	✗ no	—	Small but measurable loss; the popular default.
Q3_K_M	209.0 GB	✗ no	—	Noticeable degradation; only when you're tight on VRAM.

Weights = params × bytes/weight, +20% for KV cache & runtime; usable VRAM is 72% (Apple unified memory) of nameplate. Tokens/sec is a bandwidth ceiling (800 GB/s) — real throughput is lower with long context. Try other combinations →

Why Llama 3.1 405B won't fit a Apple M2 Ultra 192GB

Llama 3.1 405B has 405 billion parameters. Even quantized hard to Q3_K_M, its weights plus KV-cache overhead come to roughly 209 GB, well past the Apple M2 Ultra 192GB's 138 GB of usable VRAM. Spilling the overflow to system RAM (CPU offload) works but can cut throughput tenfold.

Quantization is the lever

Each step down in precision shrinks the model: FP16 needs about 972 GB, Q4_K_M about 272 GB — a 72% reduction for a small, usually acceptable quality cost. Q5_K_M and Q6_K are near-lossless if you have the headroom; drop to Q3 only when you're genuinely out of VRAM. The quantization guide covers the tradeoffs in detail.

Frequently asked questions

Can a Apple M2 Ultra 192GB run Llama 3.1 405B?

Not in VRAM. Even at the smallest practical quantization (Q3_K_M), Llama 3.1 405B needs about 209 GB versus the Apple M2 Ultra 192GB's 138 GB usable. You'd need a larger GPU, or to rent one by the hour.

How much VRAM does Llama 3.1 405B need?

Llama 3.1 405B is a 405B-parameter model. At FP16 that's about 972 GB; at Q4_K_M (the popular default) about 272 GB, including ~20% for the KV cache and runtime. Quantization is the main lever — see the per-quant table above.

Other combinations

Llama 3.1 405B on other GPUs: H200 141GB, Apple M3 Max 128GB

Other models on the Apple M2 Ultra 192GB: Llama 3.3 70B, Qwen2.5 72B, Mixtral 8x22B

VRAM calculator — any model, quant and GPU.
Running models locally — the hardware reality.
All run-locally combinations →