run it locally
Run LLMs locally
98 model-and-GPU combinations with the VRAM math worked out: what fits, what doesn't, the quantization to use, and the tokens/sec to expect. A green check means it fits at Q4_K_M, the popular default. Need a combination that isn't here? The VRAM calculator covers every pairing.
Llama 3.1 8B 8B
Gemma 2 9B 9.2B
Mistral 7B 7.3B
Qwen2.5 7B 7.6B
Phi-3 Medium 14B 14B
Gemma 2 27B 27.2B
Qwen2.5 32B 32.5B
Command R 35B 35B
Mixtral 8x7B 46.7B
Llama 3.3 70B 70B
Qwen2.5 72B 72.7B
Mixtral 8x22B 141B
Llama 3.1 405B 405B
How to read these
Running a model locally means fitting its weights in GPU memory. The footprint is the parameter count times the bytes-per-weight of your quantization, plus about 20% for the KV cache and runtime. Each page works that out at every quantization, tells you whether it fits the card's usable VRAM, and estimates speed from memory bandwidth. Where a model won't fit, we point to the smallest GPU that will — and a way to rent it by the hour instead of buying. The green/amber dot above is a quick fit check at Q4_K_M.
Related
- VRAM calculator — pick any model, quant and GPU.
- Running models locally — hardware reality and what actually runs.
- Quantization explained — Q4 vs Q5 vs Q8 quality.