The Memory Wall: A Field Guide to LLM Inference on Consumer Hardware

The Central Fact

Autoregressive token generation reads the entire model from memory for every single token it produces. Not once per request — once per token.

An 8B model at Q4_K_M quantization is roughly 4.5 GB of weight data. For each token your GPU generates, it streams those 4.5 GB through its memory hierarchy. At 1,000 GB/s bandwidth, that is 4.5 ms per token — a theoretical ceiling of about 220 tokens per second. The actual ceiling is lower because real memory utilization on LLM workloads lands at 60–85% of the nominal bandwidth figure.

This is the memory wall. It is not a bug. It is the arithmetic of autoregressive decoding.

The implication: on any consumer hardware, improving inference speed means either (1) reducing how much weight data is streamed per token, or (2) increasing memory bandwidth. Everything else — faster CPUs, more CUDA cores, larger GPUs — is secondary until you have addressed the memory bandwidth constraint.

The Memory Math

What Fits

The dominant term in memory consumption is model weights:


weight_bytes = num_parameters × bytes_per_weight


Bytes per weight by quantization format:
| Format | Bytes/param | Notes |
|--------|-------------|-------|
| FP32   | 4.0         | Training reference; rare for inference |
| BF16/FP16 | 2.0     | Default for full-precision inference |
| Q8_0   | 1.0         | 8-bit; near-lossless |
| Q6_K   | 0.75        | 6-bit; excellent quality |
| Q5_K_M | 0.625       | 5-bit; strong quality-size balance |
| Q4_K_M | 0.5         | 4-bit mixed; the practical standard |
| Q3_K_M | 0.375       | 3-bit; quality starts to degrade noticeably |
| Q2_K   | 0.25        | 2-bit; use only where nothing else fits |
The rule of thumb for Q4_K_M: model size in GB ≈ parameters in billions ÷ 2. A 7B model is ~4.1 GB. A 13B is ~8 GB. A 70B is ~40 GB. These numbers are for weights only.
Total VRAM required:


total_vram ≈ weight_bytes + kv_cache_bytes + 1–2 GB overhead


The KV cache is a secondary but non-trivial cost. At 4K context, budget 10–20% on top of weights. At 32K context, the KV cache can rival the weights for large models.
What Controls Speed
Decode tokens per second scales approximately as:


tok/s ≈ (memory_bandwidth_GBps × utilization) / weight_bytes_GB


Real GPU workloads achieve 60–75% utilization on llama.cpp, 70–85% on well-tuned MLX kernels.
This formula has two inputs you control: quantization (weight_bytes) and hardware selection (bandwidth). The formula has one input you cannot control: the bandwidth ceiling of your device.
Consumer Hardware Landscape (2026)
Apple Silicon
Apple's unified memory architecture is structurally different from discrete GPU systems. The CPU, GPU, and Neural Engine share a single memory pool with a single high-bandwidth bus. There is no VRAM separate from RAM — the full 16 GB, 32 GB, 64 GB, or 96 GB is available to the inference engine.
Bandwidth figures by chip:
| Chip | Memory bandwidth | Max unified RAM |
|------|-----------------|-----------------|
| M4 Pro | 273 GB/s | 64 GB |
| M4 Max | 546 GB/s | 128 GB |
| M3 Ultra | 819 GB/s | 192 GB |
| M2 Ultra | 800 GB/s | 192 GB |
Practical decode speeds on llama.cpp with Q4_K_M:
M4 Pro, 24 GB, Llama 3.1 8B: ~60–80 tok/s
M4 Max, 64 GB, Llama 3.1 70B: ~14–18 tok/s
M3 Ultra, 192 GB, Llama 3.1 8B: ~145 tok/s

The key advantage of unified memory is not bandwidth — it is capacity. An M2 Pro with 32 GB can run Llama 3.1 70B Q4_K_M at ~8 tok/s. No consumer NVIDIA GPU can do this without CPU offload (which collapses throughput to 2–4 tok/s due to PCIe bottlenecks).
Engine choice on Apple Silicon:
llama.cpp (Metal backend): broadest model and quantization support; best operational stability; recommended for single-user or sovereign deployments.
MLX / mlx-lm: Apple's own ML framework, purpose-built for the M-series; achieves 21–87% higher throughput than llama.cpp Metal on many models; better for multi-user serving.
Ollama: wraps llama.cpp; slightly lower throughput (roughly 10–20% behind bare llama.cpp); recommended when developer experience and API compatibility matter more than raw speed.
vLLM (experimental Metal/MLX backend): continuous batching for multi-user workloads; still maturing on Apple Silicon.

A May 2026 benchmark on M5 Max (128 GB) found llama.cpp outperforming MLX by 10–24% for specific coding models — the "MLX is always faster on Mac" claim is not universal and depends on model architecture.
NVIDIA Consumer GPUs
NVIDIA's consumer line uses GDDR6X or GDDR7 memory with discrete VRAM.
Key cards for LLM inference (2026):
| Card | VRAM | Bandwidth | Practical use |
|------|------|-----------|---------------|
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | 13B–34B Q4; competitive on bandwidth |
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | 13B–34B Q4; ~30 tok/s on 13B Q4 |
| RTX 5090 | 32 GB GDDR7 | 1,792 GB/s | 34B Q4; ~161 tok/s on 8B 6-bit |
| RTX 4060 Ti 16 GB | 16 GB GDDR6 | 288 GB/s | 7B–8B Q4; budget single-card |
The RTX 3090 remains competitive in 2026 because bandwidth, not compute, is the bottleneck. It reaches 936 GB/s — nearly identical to the RTX 4090 — and frequently matches the newer card on tokens per second for memory-bound workloads. The upgrade case for 3090→4090 on inference alone is weak.
The hard performance discontinuity is not between card generations — it is the capacity cliff. A 12 GB card that runs a Q4_K_M 7B model fully in VRAM outperforms a 24 GB card that must offload to system RAM, even if both have identical bandwidth. PCIe 4.0 x16 delivers ~32 GB/s; GPU memory delivers 900+ GB/s. The moment a model overflows VRAM, decode throughput falls by 20–50× on large offloaded layers.
Engine choice on NVIDIA consumer hardware:
llama.cpp (CUDA backend): single-user workloads; excellent quantization support; partial CPU offload when models exceed VRAM.
Ollama: wraps llama.cpp CUDA; recommended for API-compatible local serving.
vLLM: designed for multi-user throughput; most effective at 24 GB+ VRAM; PagedAttention and continuous batching yield 20–30× higher throughput than naive single-request serving at scale.

CPU-Only Inference
CPU inference via llama.cpp is viable for 7B models at Q4_K_M on modern CPUs with sufficient RAM, but speed is typically 2–8 tok/s — usable for background tasks, not interactive sessions.
The bottleneck is memory bandwidth. A modern DDR5-6400 CPU system delivers ~100–150 GB/s. At that bandwidth, a 4.5 GB Q4_K_M 7B model runs at roughly 15–22 tok/s theoretical maximum; real-world is 5–10 tok/s due to NUMA effects, OS overhead, and memory controller inefficiency.
CPU inference is not a fallback strategy for large models. A 70B Q4_K_M model at ~40 GB weight data runs at 2–4 tok/s on the fastest consumer CPUs. That is the arithmetic, not a software limitation.
Practical Sizing Guide
Decision sequence:
1. Does the model fit entirely in VRAM (or unified memory)? If not, either quantize down or choose different hardware. Partial offload collapses throughput.
2. What quantization preserves acceptable quality for your use case? Q4_K_M is the de facto standard — ~1–3% quality degradation on benchmarks, 75% VRAM reduction versus FP16. Q5_K_M or Q6_K if quality is paramount. Q3_K_M only when nothing else fits.

3. What is the bandwidth ceiling of your hardware? tok/s ≈ (bandwidth × utilization) / weight_GB`. Estimate before you buy.

4. Single-user or multi-user? Single-user: llama.cpp or Ollama. Multi-user at scale: vLLM (NVIDIA) or MLX vLLM plugin (Apple Silicon).

Worked examples:

7B model, interactive single-user session:

RTX 4090 (24 GB, 1,008 GB/s): Q4_K_M, ~4.5 GB weights → ~150 tok/s theoretical, ~90–120 tok/s real-world. Excellent.
M4 Pro (24 GB, 273 GB/s): same quantization → ~40–60 tok/s. Comfortable for interactive use.
CPU only (DDR5-6400, ~130 GB/s): ~15–25 tok/s theoretical, ~8–12 tok/s real. Marginal.

70B model:

M2 Pro 32 GB: Q4_K_M fits (~40 GB weights + overhead exceeds 32 GB). Must use Q3_K_M (~26 GB). ~5–7 tok/s. Works but not fluent.
M4 Max 64 GB: Q4_K_M fits comfortably. ~14–18 tok/s. Usable for serious work.
RTX 4090 24 GB: does not fit. Must offload or use Q2_K (~18 GB weights, significant quality loss). Not recommended.
2× RTX 3090 48 GB: Q4_K_M fits across split. llama.cpp multi-GPU support enables this. ~10–15 tok/s combined.

The Quantization Decision

Quantization is not uniformly lossy. The degradation is model-specific, task-specific, and layer-specific. Q4_K_M uses mixed-precision: attention layers stay at a higher bit depth than FFN layers, because attention weights are more sensitive to precision loss.

The practical hierarchy:

Q8_0: use when VRAM is ample and you want the closest-to-FP16 experience.
Q6_K: the quality-size sweet spot for most use cases. ~0.75 bytes/param.
Q5_K_M or Q4_K_M: standard range for consumer hardware. Q4_K_M is the most widely tested.
Q3_K_M: use only when Q4_K_M does not fit. Measurable quality degradation on reasoning tasks.
Q2_K or IQ2: last resort. Suitable only for tasks where fluency matters more than precision.

The decision formula: start at Q4_K_M. Move up if you have VRAM headroom and quality is critical. Move down only if the model does not fit, and accept the quality penalty explicitly — do not move down hoping the degradation is invisible.

What This Means for Self-Hosted AI

The memory wall is not a temporary limitation. It is the architectural consequence of autoregressive decoding. It will not be solved by faster CUDA cores. It will be partially addressed by:

Speculative decoding: running a small draft model to propose multiple tokens in parallel, then verifying with the target model in a single forward pass. Reduces effective memory reads per output token by 2–4× when draft acceptance is high.
Better quantization formats: IQ (importance-aware quantization) schemes that allocate bits non-uniformly based on parameter sensitivity.
Architecture changes: MoE models activate a fraction of parameters per token, reducing effective weight streaming. Deepseek-R1-671B at Q4_K_M streams ~40 GB per token versus a dense 70B model's 40 GB — same bandwidth consumption with ten times the parameter count.

None of these eliminate the wall. They work within it.

Summary

| Question | Answer |

|----------|--------|

| What bottlenecks LLM decode speed? | Memory bandwidth — not compute |

| How much VRAM does a 7B Q4_K_M model need? | ~5–6 GB (weights + KV cache + overhead) |

| What happens when the model overflows VRAM? | Throughput drops 20–50× |

| Best consumer GPU for inference? | RTX 3090/4090 for NVIDIA; M4 Max for Apple |

| What quantization for daily use? | Q4_K_M as default; Q5_K_M or Q6_K if VRAM allows |

| When to use vLLM vs llama.cpp? | vLLM for multi-user throughput; llama.cpp for single-user or sovereign |

The memory wall is the first thing to understand about consumer inference. Everything else — engine choice, quantization strategy, hardware selection — is a derivation from that single fact.

References

1. Wallace, E. et al., 2024. "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions." arXiv:2404.13208. OpenAI. https://arxiv.org/html/2404.13208v1

2. Locara, 2026. "LLM Memory Math — Parameters, KV Cache, Bandwidth, and What Actually Fits." https://locara.dev/docs/notes/llm-memory-math

3. AI/TLDR, 2026. "LLM Hardware Guide: How Much RAM and VRAM You Need." https://ai-tldr.dev/learn/local-open-models/running-models-locally/local-llm-hardware-requirements/

4. Hardware Corner, 2026. "How Memory Chips Determine GPU Memory Bandwidth for Local LLM Inference." https://www.hardware-corner.net/gddr-chips-and-llm-bandwidth/

5. ai.rs, 2026. "The GPU Memory Wall: Why Inference Hardware Matters." https://ai.rs/ai-developer/gpu-memory-wall-inference-hardware

6. Hiesch, A., 2026. "llama.cpp on Apple Silicon: 29 GGUF Benchmarks and a 200 t/s Surprise." https://hiesch.eu/blog/llamacpp-benchmarks-speculative-decoding/

7. arXiv:2601.19139, 2026. "Native LLM and MLLM Inference at Scale on Apple Silicon." https://doi.org/10.48550/arxiv.2601.19139

8. arXiv:2605.00519, 2026. "Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference." https://arxiv.org/html/2605.00519

9. Khamdee, P., 2026. "How Much VRAM Does Your LLM Actually Need? A Field Guide to Sizing GPUs." https://pkhamdee.blog/2026/06/17/how-much-vram-does-your-llm-actually-need-a-field-guide-to-sizing-gpus/

10. RunPod, 2026. "GPU Memory Sizing Guide for LLM Inference." https://www.runpod.io/articles/guides/gpu-memory-sizing-guide-for-llm-inference

11. Contracollective, 2026. "llama.cpp vs MLX vs Ollama vs vLLM: Local AI Inference for Apple Silicon in 2026." https://contracollective.com/blog/llama-cpp-vs-mlx-ollama-vllm-apple-silicon-2026