guide

Running models locally: the hardware reality

Running an LLM on your own machine is mostly a memory problem. Get the VRAM math right and the rest follows — here's what actually fits on the cards people own, and how fast to expect.

Updated · 9 min read

Running a model locally means loading its weights into memory and generating tokens on your own hardware — no API, no per-token bill, no data leaving the building. The appeal is real: privacy and data residency, predictable cost at steady volume, offline capability, and total control over the model and its versions. The constraint is equally real: the whole model has to fit in memory, and that single fact decides almost everything.

Check a specific model and card right now: Open the VRAM calculator →

The VRAM math, in one paragraph

A model's memory footprint is its parameter count times the bytes-per-weight of your quantization, plus about 20% for the KV cache, activations and runtime. At Q4_K_M (~0.56 bytes/weight) a 7B model needs ~5 GB, a 14B ~9 GB, a 34B ~22 GB, and a 70B ~47 GB. Your GPU's usable VRAM is about 95% of its nameplate (a bit less on Apple's shared memory). Compare the two and you have your answer. Quantization is the lever that moves a model between "needs a server" and "fits the card you have."

What fits on the cards people own

VRAMexample cardscomfortable at Q4_K_M
8 GBRTX 3060 Ti, 40607B–8B models, short context
12 GBRTX 3060 12GB, 40707B–13B comfortably
16 GBRTX 4060 Ti 16GB, 4080Up to ~14B with room; tight 22B
24 GBRTX 3090, 4090Up to ~34B comfortably; 70B only at low quant/short context
48 GBRTX A6000, L40S70B at Q4 with usable context
64–192 GBApple M-series unified70B and beyond — slower, but it fits

"Comfortable" leaves headroom for context. Longer prompts grow the KV cache and push these down a notch. The run-locally pages work out specific pairings.

Speed is set by memory bandwidth, not raw compute

Generating each token requires reading the entire model from memory once, so token throughput is roughly the GPU's memory bandwidth divided by the model's size in VRAM. That's why an H100 (~3.35 TB/s) is far faster than a 4090 (~1 TB/s) on the same model, and why Apple's large-but-slower unified memory fits big models yet runs them at a gentler pace. It also explains the quantization bonus: a smaller model is fewer bytes to read per token, so it's both smaller and faster. Treat any tokens-per-second estimate as a bandwidth ceiling — real throughput drops with long context, small batches, and CPU offload.

Don't offload to system RAM if you can help it. When a model doesn't fit, runtimes spill layers to CPU memory, whose bandwidth is a fraction of the GPU's. The model runs, but throughput can fall tenfold. For interactive use, fitting entirely in VRAM — or renting a bigger GPU — beats offloading.

A note on mixture-of-experts

Mixture-of-experts (MoE) models like the Mixtral family or DeepSeek's larger models have a high total parameter count but activate only a fraction per token. That makes them fast for their size — but all the weights still have to live in VRAM, because any expert can be needed on any token. So an MoE is sized by its total parameters for the fit question and by its active parameters for the speed question: heavy to load, light to run.

Context length eats memory too

The weights are only part of the footprint. The KV cache — the running memory of the conversation — grows with every token in the context window, and on a long context it can consume several gigabytes on top of the model. That's why a model that fits comfortably with a 4K prompt can run out of memory at 32K, and why our estimates add headroom rather than quoting weights alone. If you're tight, you have levers: cap the context length you allow, use a runtime with KV-cache quantization (storing the cache at lower precision), or drop the model one quant level to buy back the room. Plan for the context you'll actually use, not just the weights — it's the difference between a configuration that loads and one that loads and then crashes mid-conversation.

The software stack

  • Ollama — the simplest start. One command to pull and run a model; sensible defaults; great for a single user. Built on llama.cpp.
  • llama.cpp — the engine underneath much of the ecosystem. Maximum control, GGUF quants, CPU+GPU splitting. Worth it when you want to tune.
  • LM Studio — a friendly desktop GUI over the same engine, with model browsing and a local API server. Good for experimenting without the terminal.
  • vLLM / TGI — high-throughput servers for many concurrent users, GPU-only, using AWQ or GPTQ quants. This is the production-serving tier, not the hobbyist one.

Local vs. cloud: do the break-even

Owning a GPU is fixed cost; an API is variable cost. Local wins when volume is steady and high enough that the hardware pays for itself, when privacy is mandatory, or when you need to run offline. A hosted API wins for spiky or low volume, for access to frontier models you can't run at home, and for not maintaining hardware. And there's a middle path: rent a cloud GPU by the hour for the occasional big-model run, so you pay for capability only when you use it. When a model doesn't fit your card, our run-locally pages point to the smallest GPU that does — and a way to rent it.

Frequently asked questions

What's the best GPU for running LLMs locally?

For most people it's a used RTX 3090 or a 4090 — both have 24 GB of VRAM, which is the practical sweet spot: it runs strong 7B–34B models comfortably and gets a 70B going at low quantization with a short context. The 3090 is far cheaper second-hand and nearly as fast for inference because both have similar memory bandwidth. Below that, 12–16 GB cards handle 7B–14B models well. Above it, you're into workstation cards or multiple GPUs.

Can I run a 70B model at home?

Yes, but with caveats. A 70B at Q4_K_M needs about 47 GB of VRAM including overhead — more than any single consumer card. You can run it across two 24 GB GPUs, on a 48 GB workstation card, on an Apple machine with 64 GB+ of unified memory (slower), or by offloading some layers to system RAM (much slower). For interactive speed, fitting it entirely in GPU memory — or renting a cloud GPU by the hour — is usually worth it.

Is Apple Silicon good for local LLMs?

Surprisingly yes, for capacity. A Mac with 64–192 GB of unified memory can hold models that no single consumer GPU can, because the memory is shared with the GPU. The tradeoff is bandwidth: Apple's memory is slower than a high-end NVIDIA card's, so large models run but generate tokens more slowly. It's excellent for running big models at usable-if-not-blazing speed on a quiet machine.

Related