Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

The Memory Wall: A Field Guide to LLM Inference on Consumer Hardware

whitepaper | devinfo.dev | June 29, 2026 | devinfo.dev:2026.0049

LLM inference is not compute-bound. It is memory-bandwidth-bound. Understanding that single fact — and the arithmetic that follows from it — determines every sensible hardware and quantization decision you will make when running models on consumer devices.

The Memory Wall: A Field Guide to LLM Inference on Consumer Hardware

The Central Fact

Autoregressive token generation reads the entire model from memory for every single token it produces. Not once per request — once per token.

An 8B model at Q4_K_M quantization is roughly 4.5 GB of weight data. For each token your GPU generates, it streams those 4.5 GB through its memory hierarchy. At 1,000 GB/s bandwidth, that is 4.5 ms per token — a theoretical ceiling of about 220 tokens per second. The actual ceiling is lower because real memory utilization on LLM workloads lands at 60–85% of the nominal bandwidth figure.

This is the memory wall. It is not a bug. It is the arithmetic of autoregressive decoding.

The implication: on any consumer hardware, improving inference speed means either (1) reducing how much weight data is streamed per token, or (2) increasing memory bandwidth. Everything else — faster CPUs, more CUDA cores, larger GPUs — is secondary until you have addressed the memory bandwidth constraint.

The Memory Math

What Fits

The dominant term in memory consumption is model weights:

``

weight_bytes = num_parameters × bytes_per_weight

`

Bytes per weight by quantization format:

| Format | Bytes/param | Notes |

|--------|-------------|-------|

| FP32 | 4.0 | Training reference; rare for inference |

| BF16/FP16 | 2.0 | Default for full-precision inference |

| Q8_0 | 1.0 | 8-bit; near-lossless |

| Q6_K | 0.75 | 6-bit; excellent quality |

| Q5_K_M | 0.625 | 5-bit; strong quality-size balance |

| Q4_K_M | 0.5 | 4-bit mixed; the practical standard |

| Q3_K_M | 0.375 | 3-bit; quality starts to degrade noticeably |

| Q2_K | 0.25 | 2-bit; use only where nothing else fits |

The rule of thumb for Q4_K_M: model size in GB ≈ parameters in billions ÷ 2. A 7B model is ~4.1 GB. A 13B is ~8 GB. A 70B is ~40 GB. These numbers are for weights only.

Total VRAM required:

`

total_vram ≈ weight_bytes + kv_cache_bytes + 1–2 GB overhead

`

The KV cache is a secondary but non-trivial cost. At 4K context, budget 10–20% on top of weights. At 32K context, the KV cache can rival the weights for large models.

What Controls Speed

Decode tokens per second scales approximately as:

`

tok/s ≈ (memory_bandwidth_GBps × utilization) / weight_bytes_GB

`

Real GPU workloads achieve 60–75% utilization on llama.cpp, 70–85% on well-tuned MLX kernels.

This formula has two inputs you control: quantization (weight_bytes) and hardware selection (bandwidth). The formula has one input you cannot control: the bandwidth ceiling of your device.

Consumer Hardware Landscape (2026)

Apple Silicon

Apple's unified memory architecture is structurally different from discrete GPU systems. The CPU, GPU, and Neural Engine share a single memory pool with a single high-bandwidth bus. There is no VRAM separate from RAM — the full 16 GB, 32 GB, 64 GB, or 96 GB is available to the inference engine.

Bandwidth figures by chip:

| Chip | Memory bandwidth | Max unified RAM |

|------|-----------------|-----------------|

| M4 Pro | 273 GB/s | 64 GB |

| M4 Max | 546 GB/s | 128 GB |

| M3 Ultra | 819 GB/s | 192 GB |

| M2 Ultra | 800 GB/s | 192 GB |

Practical decode speeds on llama.cpp with Q4_K_M:

The key advantage of unified memory is not bandwidth — it is capacity. An M2 Pro with 32 GB can run Llama 3.1 70B Q4_K_M at ~8 tok/s. No consumer NVIDIA GPU can do this without CPU offload (which collapses throughput to 2–4 tok/s due to PCIe bottlenecks).

Engine choice on Apple Silicon:

A May 2026 benchmark on M5 Max (128 GB) found llama.cpp outperforming MLX by 10–24% for specific coding models — the "MLX is always faster on Mac" claim is not universal and depends on model architecture.

NVIDIA Consumer GPUs

NVIDIA's consumer line uses GDDR6X or GDDR7 memory with discrete VRAM.

Key cards for LLM inference (2026):

| Card | VRAM | Bandwidth | Practical use |

|------|------|-----------|---------------|

| RTX 3090 | 24 GB GDDR6X | 936 GB/s | 13B–34B Q4; competitive on bandwidth |

| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | 13B–34B Q4; ~30 tok/s on 13B Q4 |

| RTX 5090 | 32 GB GDDR7 | 1,792 GB/s | 34B Q4; ~161 tok/s on 8B 6-bit |

| RTX 4060 Ti 16 GB | 16 GB GDDR6 | 288 GB/s | 7B–8B Q4; budget single-card |

The RTX 3090 remains competitive in 2026 because bandwidth, not compute, is the bottleneck. It reaches 936 GB/s — nearly identical to the RTX 4090 — and frequently matches the newer card on tokens per second for memory-bound workloads. The upgrade case for 3090→4090 on inference alone is weak.

The hard performance discontinuity is not between card generations — it is the capacity cliff. A 12 GB card that runs a Q4_K_M 7B model fully in VRAM outperforms a 24 GB card that must offload to system RAM, even if both have identical bandwidth. PCIe 4.0 x16 delivers ~32 GB/s; GPU memory delivers 900+ GB/s. The moment a model overflows VRAM, decode throughput falls by 20–50× on large offloaded layers.

Engine choice on NVIDIA consumer hardware:

CPU-Only Inference

CPU inference via llama.cpp is viable for 7B models at Q4_K_M on modern CPUs with sufficient RAM, but speed is typically 2–8 tok/s — usable for background tasks, not interactive sessions.

The bottleneck is memory bandwidth. A modern DDR5-6400 CPU system delivers ~100–150 GB/s. At that bandwidth, a 4.5 GB Q4_K_M 7B model runs at roughly 15–22 tok/s theoretical maximum; real-world is 5–10 tok/s due to NUMA effects, OS overhead, and memory controller inefficiency.

CPU inference is not a fallback strategy for large models. A 70B Q4_K_M model at ~40 GB weight data runs at 2–4 tok/s on the fastest consumer CPUs. That is the arithmetic, not a software limitation.

Practical Sizing Guide

Decision sequence:

1. Does the model fit entirely in VRAM (or unified memory)? If not, either quantize down or choose different hardware. Partial offload collapses throughput.

2. What quantization preserves acceptable quality for your use case? Q4_K_M is the de facto standard — ~1–3% quality degradation on benchmarks, 75% VRAM reduction versus FP16. Q5_K_M or Q6_K if quality is paramount. Q3_K_M only when nothing else fits.

3. What is the bandwidth ceiling of your hardware? tok/s ≈ (bandwidth × utilization) / weight_GB`. Estimate before you buy.

4. Single-user or multi-user? Single-user: llama.cpp or Ollama. Multi-user at scale: vLLM (NVIDIA) or MLX vLLM plugin (Apple Silicon).

Worked examples:

7B model, interactive single-user session:

70B model:

The Quantization Decision

Quantization is not uniformly lossy. The degradation is model-specific, task-specific, and layer-specific. Q4_K_M uses mixed-precision: attention layers stay at a higher bit depth than FFN layers, because attention weights are more sensitive to precision loss.

The practical hierarchy:

The decision formula: start at Q4_K_M. Move up if you have VRAM headroom and quality is critical. Move down only if the model does not fit, and accept the quality penalty explicitly — do not move down hoping the degradation is invisible.

What This Means for Self-Hosted AI

The memory wall is not a temporary limitation. It is the architectural consequence of autoregressive decoding. It will not be solved by faster CUDA cores. It will be partially addressed by:

None of these eliminate the wall. They work within it.

Summary

| Question | Answer |

|----------|--------|

| What bottlenecks LLM decode speed? | Memory bandwidth — not compute |

| How much VRAM does a 7B Q4_K_M model need? | ~5–6 GB (weights + KV cache + overhead) |

| What happens when the model overflows VRAM? | Throughput drops 20–50× |

| Best consumer GPU for inference? | RTX 3090/4090 for NVIDIA; M4 Max for Apple |

| What quantization for daily use? | Q4_K_M as default; Q5_K_M or Q6_K if VRAM allows |

| When to use vLLM vs llama.cpp? | vLLM for multi-user throughput; llama.cpp for single-user or sovereign |

The memory wall is the first thing to understand about consumer inference. Everything else — engine choice, quantization strategy, hardware selection — is a derivation from that single fact.

References

1. Wallace, E. et al., 2024. "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions." arXiv:2404.13208. OpenAI. https://arxiv.org/html/2404.13208v1

2. Locara, 2026. "LLM Memory Math — Parameters, KV Cache, Bandwidth, and What Actually Fits." https://locara.dev/docs/notes/llm-memory-math

3. AI/TLDR, 2026. "LLM Hardware Guide: How Much RAM and VRAM You Need." https://ai-tldr.dev/learn/local-open-models/running-models-locally/local-llm-hardware-requirements/

4. Hardware Corner, 2026. "How Memory Chips Determine GPU Memory Bandwidth for Local LLM Inference." https://www.hardware-corner.net/gddr-chips-and-llm-bandwidth/

5. ai.rs, 2026. "The GPU Memory Wall: Why Inference Hardware Matters." https://ai.rs/ai-developer/gpu-memory-wall-inference-hardware

6. Hiesch, A., 2026. "llama.cpp on Apple Silicon: 29 GGUF Benchmarks and a 200 t/s Surprise." https://hiesch.eu/blog/llamacpp-benchmarks-speculative-decoding/

7. arXiv:2601.19139, 2026. "Native LLM and MLLM Inference at Scale on Apple Silicon." https://doi.org/10.48550/arxiv.2601.19139

8. arXiv:2605.00519, 2026. "Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference." https://arxiv.org/html/2605.00519

9. Khamdee, P., 2026. "How Much VRAM Does Your LLM Actually Need? A Field Guide to Sizing GPUs." https://pkhamdee.blog/2026/06/17/how-much-vram-does-your-llm-actually-need-a-field-guide-to-sizing-gpus/

10. RunPod, 2026. "GPU Memory Sizing Guide for LLM Inference." https://www.runpod.io/articles/guides/gpu-memory-sizing-guide-for-llm-inference

11. Contracollective, 2026. "llama.cpp vs MLX vs Ollama vs vLLM: Local AI Inference for Apple Silicon in 2026." https://contracollective.com/blog/llama-cpp-vs-mlx-ollama-vllm-apple-silicon-2026