run it locally

Run LLMs locally

98 model-and-GPU combinations with the VRAM math worked out: what fits, what doesn't, the quantization to use, and the tokens/sec to expect. A green check means it fits at Q4_K_M, the popular default. Need a combination that isn't here? The VRAM calculator covers every pairing.

Llama 3.1 8B 8B

Gemma 2 9B 9.2B

Mistral 7B 7.3B

Qwen2.5 7B 7.6B

Phi-3 Medium 14B 14B

Gemma 2 27B 27.2B

Qwen2.5 32B 32.5B

Command R 35B 35B

Mixtral 8x7B 46.7B

Llama 3.3 70B 70B

Qwen2.5 72B 72.7B

Mixtral 8x22B 141B

Llama 3.1 405B 405B

How to read these

Running a model locally means fitting its weights in GPU memory. The footprint is the parameter count times the bytes-per-weight of your quantization, plus about 20% for the KV cache and runtime. Each page works that out at every quantization, tells you whether it fits the card's usable VRAM, and estimates speed from memory bandwidth. Where a model won't fit, we point to the smallest GPU that will — and a way to rent it by the hour instead of buying. The green/amber dot above is a quick fit check at Q4_K_M.

Related