How much VRAM do I need to run an 8B model?

About 6–8 GB at 4-bit, including room for context. An 8 GB card runs an 8B model comfortably for chat, coding help, and agents.

How much VRAM does a 70B model need?

Roughly 40–44 GB at 4-bit with usable context. That means a 48 GB setup, or a single 24 GB card only at very short context.

Can 24 GB of VRAM run a 70B model?

Just barely, at 4-bit and short context. A 24 GB card is the sweet spot for everything up to ~32B comfortably; a 70B fits only in a pinch.

Does context length affect VRAM usage?

Yes. The KV cache grows with context length, so long chats and large documents need several extra GB on top of the model weights.

How Much VRAM Do You Need to Run Llama, Qwen, and DeepSeek?

VRAM is the gatekeeper of local AI. Run out of it and the model either won’t load or spills into painfully slow system RAM. The good news: with 4-bit quantization (the standard for local use), the math is simple and forgiving. Get this one number right and you’ll never overspend on a card, or buy one that can’t run what you need.

The rule of thumb

VRAM (GB) ≈ parameters in billions ÷ 1.5 + 2–4 GB for context

A 4-bit weight is half a byte, so a 13B model is ~6.5 GB of weights, plus overhead for the KV cache (which grows with context length). Round up and leave headroom.

The VRAM table (4-bit, Q4_K_M)

Model size	Weights ≈	+ context	Fits on
3B	~2 GB	~3 GB	any modern GPU, even 6 GB
8B	~5 GB	~6–8 GB	8 GB card and up
14B	~8 GB	~10–12 GB	12 GB card and up
32B	~18 GB	~20–22 GB	24 GB card
70B	~38 GB	~40–44 GB	48 GB, or 24 GB at short context

Quantization: the dial that sets your VRAM bill

“4-bit” isn’t the only option: quantization is a quality-vs-size trade-off, and choosing the right level is how you fit a bigger, smarter model on the card you already own:

Quant	Bits/weight	VRAM vs FP16	Quality
Q8_0	8	~50%	near-lossless; rarely worth the size
Q6_K	6	~37%	excellent; the quality-conscious pick
Q4_K_M	~4.5	~28%	the sweet spot: minimal quality loss, big savings
Q3_K_M	~3.5	~22%	noticeable degradation; only when desperate

For almost everyone, Q4_K_M is the right default: you keep ~95%+ of the model’s quality at roughly a quarter of the memory. Step up to Q6 only if you have VRAM to spare and want the last few percent; drop to Q3 only to squeeze a model that otherwise won’t fit at all.

Don’t forget the KV cache (the hidden VRAM eater)

Weights are fixed, but the KV cache grows with how much text the model is holding in context, and it can be surprisingly large. A rough feel: a 7B model at 8K context adds ~1 GB of cache; push to 32K context and that can be several GB. This is why two people running “the same model” can have very different VRAM needs.

Practical levers to control it:

Cap your context to what you actually use. A 32K window you never fill just wastes VRAM.
Enable KV-cache quantization (most runtimes support 8-bit cache) to roughly halve it.
Budget for it: when sizing a card, add context headroom on top of the weights, never just the weights alone.

So what should you buy?

8 GB (e.g. RTX 3060): great for 8B models: plenty for chat, coding help, agents.
12–16 GB: comfortable up to 14B.
24 GB (RTX 3090/4090): the sweet spot: runs up to 32B easily and a 70B in a pinch.
48 GB (2× 24 GB): the home ceiling: runs a 70B at full context comfortably.

24 GB sweet spot

NVIDIA GeForce RTX 3090

24 GiB VRAM
350 W TDP
936 GB/s
2020

~$700 street price

Check price

24 GB is the single best capacity-per-dollar for local AI. It covers everything most people will ever run, including quantized 70B models at modest context. A used one is the card we’d buy first.

How to check what you can actually run

Don’t guess: measure. Two quick ways:

Before downloading: apply the rule of thumb (billions ÷ 1.5 + context). If the total is under your card’s VRAM with a couple of GB to spare, it’ll fit.
While running: watch nvidia-smi (or your runtime’s memory readout) as you load the model and grow the context. If usage creeps toward your card’s limit, lower the context window or step down a quant before you hit an out-of-memory error.

Tools like Ollama and LM Studio will also warn you (or silently offload to RAM) when a model won’t fit; see our tools comparison for how each handles it.

A few more gotchas

Higher quants need more. Q6 or Q8 give slightly better quality but use more memory.
Two cards add up. 2× 24 GB = 48 GB usable for the biggest models, if your software supports splitting, often a smarter buy than one faster card.
Offloading to RAM works, slowly. If a model almost fits, runtimes can push a few layers to system RAM. It runs, but expect a big speed hit; treat it as a stopgap.

Match the model to the card, not the hype. Most people are happiest running a fast 8B–14B on a card they already have than struggling to fit a 70B they don’t need. Once you know your number, the cheapest-way guide shows exactly what to buy.

How Much VRAM Do You Need to Run Llama, Qwen, and DeepSeek?

The rule of thumb

The VRAM table (4-bit, Q4_K_M)

Quantization: the dial that sets your VRAM bill

Don’t forget the KV cache (the hidden VRAM eater)

So what should you buy?

NVIDIA GeForce RTX 3090

How to check what you can actually run

A few more gotchas

Gear mentioned in this post

Frequently asked questions

Related reading

The Cheapest Way to Run a Local LLM in 2026

RTX 3090 vs RTX 4090 for Local AI: Which Should You Buy in 2026?

The rule of thumb

The VRAM table (4-bit, Q4_K_M)

Quantization: the dial that sets your VRAM bill

Don’t forget the KV cache (the hidden VRAM eater)

So what should you buy?

NVIDIA GeForce RTX 3090

How to check what you can actually run

A few more gotchas

Gear mentioned in this post

Frequently asked questions

Get tested, not hyped.

Related reading

The Cheapest Way to Run a Local LLM in 2026

RTX 3090 vs RTX 4090 for Local AI: Which Should You Buy in 2026?