run it locally

Run LLMs locally

98 model-and-GPU combinations with the VRAM math worked out: what fits, what doesn't, the quantization to use, and the tokens/sec to expect. A green check means it fits at Q4_K_M, the popular default. Need a combination that isn't here? The VRAM calculator covers every pairing.

Llama 3.1 8B 8B

RTX 3060 12GB RTX 4070 Ti RTX 4080

Gemma 2 9B 9.2B

RTX 3060 12GB RTX 4070 Ti RTX 4080 RTX 3090 RTX 4090 NVIDIA L4

Mistral 7B 7.3B

RTX 3060 12GB RTX 4070 Ti RTX 4080

Qwen2.5 7B 7.6B

RTX 3060 12GB RTX 4070 Ti RTX 4080

Phi-3 Medium 14B 14B

RTX 3060 12GB RTX 4070 Ti RTX 4080 RTX 3090 RTX 4090 NVIDIA L4

Gemma 2 27B 27.2B

RTX 3060 12GB RTX 4070 Ti RTX 4080 RTX 3090 RTX 4090 NVIDIA L4 RTX A6000 NVIDIA L40S A100 40GB

Qwen2.5 32B 32.5B

RTX 3060 12GB RTX 4070 Ti RTX 4080 RTX 3090 RTX 4090 NVIDIA L4 RTX A6000 NVIDIA L40S A100 40GB A100 80GB H100 80GB

Command R 35B 35B

RTX 3060 12GB RTX 4070 Ti RTX 4080 RTX 3090 RTX 4090 NVIDIA L4 RTX A6000 NVIDIA L40S A100 40GB A100 80GB H100 80GB Apple M3 Max 128GB

Mixtral 8x7B 46.7B

RTX 3060 12GB RTX 4070 Ti RTX 4080 RTX 3090 RTX 4090 NVIDIA L4 RTX A6000 NVIDIA L40S A100 40GB A100 80GB H100 80GB Apple M3 Max 128GB

Llama 3.3 70B 70B

RTX 3090 RTX 4090 NVIDIA L4 RTX A6000 NVIDIA L40S A100 40GB A100 80GB H100 80GB H200 141GB Apple M3 Max 128GB Apple M2 Ultra 192GB

Qwen2.5 72B 72.7B

RTX 3090 RTX 4090 NVIDIA L4 RTX A6000 NVIDIA L40S A100 40GB A100 80GB H100 80GB H200 141GB Apple M3 Max 128GB Apple M2 Ultra 192GB

Mixtral 8x22B 141B

RTX A6000 NVIDIA L40S A100 40GB A100 80GB H100 80GB H200 141GB Apple M3 Max 128GB Apple M2 Ultra 192GB

Llama 3.1 405B 405B

H200 141GB Apple M3 Max 128GB Apple M2 Ultra 192GB

How to read these

Running a model locally means fitting its weights in GPU memory. The footprint is the parameter count times the bytes-per-weight of your quantization, plus about 20% for the KV cache and runtime. Each page works that out at every quantization, tells you whether it fits the card's usable VRAM, and estimates speed from memory bandwidth. Where a model won't fit, we point to the smallest GPU that will — and a way to rent it by the hour instead of buying. The green/amber dot above is a quick fit check at Q4_K_M.

VRAM calculator — pick any model, quant and GPU.
Running models locally — hardware reality and what actually runs.
Quantization explained — Q4 vs Q5 vs Q8 quality.

Run LLMs locally

Llama 3.1 8B 8B

Gemma 2 9B 9.2B

Mistral 7B 7.3B

Qwen2.5 7B 7.6B

Phi-3 Medium 14B 14B

Gemma 2 27B 27.2B

Qwen2.5 32B 32.5B

Command R 35B 35B

Mixtral 8x7B 46.7B

Llama 3.3 70B 70B

Qwen2.5 72B 72.7B

Mixtral 8x22B 141B

Llama 3.1 405B 405B

How to read these

Related