Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Autoregressive token generation reads the entire model from memory for every single token it produces. Not once per request — once per token.
An 8B model at Q4_K_M quantization is roughly 4.5 GB of weight data. For each token your GPU generates, it streams those 4.5 GB through its memory hierarchy. At 1,000 GB/s bandwidth, that is 4.5 ms per token — a theoretical ceiling of about 220 tokens per second. The actual ceiling is lower because real memory utilization on LLM workloads lands at 60–85% of the nominal bandwidth figure.
This is the memory wall. It is not a bug. It is the arithmetic of autoregressive decoding.
The implication: on any consumer hardware, improving inference speed means either (1) reducing how much weight data is streamed per token, or (2) increasing memory bandwidth. Everything else — faster CPUs, more CUDA cores, larger GPUs — is secondary until you have addressed the memory bandwidth constraint.
The dominant term in memory consumption is model weights:
``
weight_bytes = num_parameters × bytes_per_weight
`
Bytes per weight by quantization format:
| Format | Bytes/param | Notes |
|--------|-------------|-------|
| FP32 | 4.0 | Training reference; rare for inference |
| BF16/FP16 | 2.0 | Default for full-precision inference |
| Q8_0 | 1.0 | 8-bit; near-lossless |
| Q6_K | 0.75 | 6-bit; excellent quality |
| Q5_K_M | 0.625 | 5-bit; strong quality-size balance |
| Q4_K_M | 0.5 | 4-bit mixed; the practical standard |
| Q3_K_M | 0.375 | 3-bit; quality starts to degrade noticeably |
| Q2_K | 0.25 | 2-bit; use only where nothing else fits |
The rule of thumb for Q4_K_M: model size in GB ≈ parameters in billions ÷ 2. A 7B model is ~4.1 GB. A 13B is ~8 GB. A 70B is ~40 GB. These numbers are for weights only.
Total VRAM required:
`
total_vram ≈ weight_bytes + kv_cache_bytes + 1–2 GB overhead
`
The KV cache is a secondary but non-trivial cost. At 4K context, budget 10–20% on top of weights. At 32K context, the KV cache can rival the weights for large models.
What Controls Speed
Decode tokens per second scales approximately as:
`
tok/s ≈ (memory_bandwidth_GBps × utilization) / weight_bytes_GB
`
Real GPU workloads achieve 60–75% utilization on llama.cpp, 70–85% on well-tuned MLX kernels.
This formula has two inputs you control: quantization (weight_bytes) and hardware selection (bandwidth). The formula has one input you cannot control: the bandwidth ceiling of your device.
Consumer Hardware Landscape (2026)
Apple Silicon
Apple's unified memory architecture is structurally different from discrete GPU systems. The CPU, GPU, and Neural Engine share a single memory pool with a single high-bandwidth bus. There is no VRAM separate from RAM — the full 16 GB, 32 GB, 64 GB, or 96 GB is available to the inference engine.
Bandwidth figures by chip:
| Chip | Memory bandwidth | Max unified RAM |
|------|-----------------|-----------------|
| M4 Pro | 273 GB/s | 64 GB |
| M4 Max | 546 GB/s | 128 GB |
| M3 Ultra | 819 GB/s | 192 GB |
| M2 Ultra | 800 GB/s | 192 GB |
Practical decode speeds on llama.cpp with Q4_K_M:
- M4 Pro, 24 GB, Llama 3.1 8B: ~60–80 tok/s
- M4 Max, 64 GB, Llama 3.1 70B: ~14–18 tok/s
- M3 Ultra, 192 GB, Llama 3.1 8B: ~145 tok/s
The key advantage of unified memory is not bandwidth — it is capacity. An M2 Pro with 32 GB can run Llama 3.1 70B Q4_K_M at ~8 tok/s. No consumer NVIDIA GPU can do this without CPU offload (which collapses throughput to 2–4 tok/s due to PCIe bottlenecks).
Engine choice on Apple Silicon:
- llama.cpp (Metal backend): broadest model and quantization support; best operational stability; recommended for single-user or sovereign deployments.
- MLX / mlx-lm: Apple's own ML framework, purpose-built for the M-series; achieves 21–87% higher throughput than llama.cpp Metal on many models; better for multi-user serving.
- Ollama: wraps llama.cpp; slightly lower throughput (roughly 10–20% behind bare llama.cpp); recommended when developer experience and API compatibility matter more than raw speed.
- vLLM (experimental Metal/MLX backend): continuous batching for multi-user workloads; still maturing on Apple Silicon.
A May 2026 benchmark on M5 Max (128 GB) found llama.cpp outperforming MLX by 10–24% for specific coding models — the "MLX is always faster on Mac" claim is not universal and depends on model architecture.
NVIDIA Consumer GPUs
NVIDIA's consumer line uses GDDR6X or GDDR7 memory with discrete VRAM.
Key cards for LLM inference (2026):
| Card | VRAM | Bandwidth | Practical use |
|------|------|-----------|---------------|
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | 13B–34B Q4; competitive on bandwidth |
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | 13B–34B Q4; ~30 tok/s on 13B Q4 |
| RTX 5090 | 32 GB GDDR7 | 1,792 GB/s | 34B Q4; ~161 tok/s on 8B 6-bit |
| RTX 4060 Ti 16 GB | 16 GB GDDR6 | 288 GB/s | 7B–8B Q4; budget single-card |
The RTX 3090 remains competitive in 2026 because bandwidth, not compute, is the bottleneck. It reaches 936 GB/s — nearly identical to the RTX 4090 — and frequently matches the newer card on tokens per second for memory-bound workloads. The upgrade case for 3090→4090 on inference alone is weak.
The hard performance discontinuity is not between card generations — it is the capacity cliff. A 12 GB card that runs a Q4_K_M 7B model fully in VRAM outperforms a 24 GB card that must offload to system RAM, even if both have identical bandwidth. PCIe 4.0 x16 delivers ~32 GB/s; GPU memory delivers 900+ GB/s. The moment a model overflows VRAM, decode throughput falls by 20–50× on large offloaded layers.
Engine choice on NVIDIA consumer hardware:
- llama.cpp (CUDA backend): single-user workloads; excellent quantization support; partial CPU offload when models exceed VRAM.
- Ollama: wraps llama.cpp CUDA; recommended for API-compatible local serving.
- vLLM: designed for multi-user throughput; most effective at 24 GB+ VRAM; PagedAttention and continuous batching yield 20–30× higher throughput than naive single-request serving at scale.
CPU-Only Inference
CPU inference via llama.cpp is viable for 7B models at Q4_K_M on modern CPUs with sufficient RAM, but speed is typically 2–8 tok/s — usable for background tasks, not interactive sessions.
The bottleneck is memory bandwidth. A modern DDR5-6400 CPU system delivers ~100–150 GB/s. At that bandwidth, a 4.5 GB Q4_K_M 7B model runs at roughly 15–22 tok/s theoretical maximum; real-world is 5–10 tok/s due to NUMA effects, OS overhead, and memory controller inefficiency.
CPU inference is not a fallback strategy for large models. A 70B Q4_K_M model at ~40 GB weight data runs at 2–4 tok/s on the fastest consumer CPUs. That is the arithmetic, not a software limitation.
Practical Sizing Guide
Decision sequence:
1. Does the model fit entirely in VRAM (or unified memory)? If not, either quantize down or choose different hardware. Partial offload collapses throughput.
2. What quantization preserves acceptable quality for your use case? Q4_K_M is the de facto standard — ~1–3% quality degradation on benchmarks, 75% VRAM reduction versus FP16. Q5_K_M or Q6_K if quality is paramount. Q3_K_M only when nothing else fits.
3. What is the bandwidth ceiling of your hardware?
tok/s ≈ (bandwidth × utilization) / weight_GB`. Estimate before you buy.
4. Single-user or multi-user? Single-user: llama.cpp or Ollama. Multi-user at scale: vLLM (NVIDIA) or MLX vLLM plugin (Apple Silicon).
Worked examples:
7B model, interactive single-user session:
70B model:
Quantization is not uniformly lossy. The degradation is model-specific, task-specific, and layer-specific. Q4_K_M uses mixed-precision: attention layers stay at a higher bit depth than FFN layers, because attention weights are more sensitive to precision loss.
The practical hierarchy:
The decision formula: start at Q4_K_M. Move up if you have VRAM headroom and quality is critical. Move down only if the model does not fit, and accept the quality penalty explicitly — do not move down hoping the degradation is invisible.
The memory wall is not a temporary limitation. It is the architectural consequence of autoregressive decoding. It will not be solved by faster CUDA cores. It will be partially addressed by:
None of these eliminate the wall. They work within it.
| Question | Answer |
|----------|--------|
| What bottlenecks LLM decode speed? | Memory bandwidth — not compute |
| How much VRAM does a 7B Q4_K_M model need? | ~5–6 GB (weights + KV cache + overhead) |
| What happens when the model overflows VRAM? | Throughput drops 20–50× |
| Best consumer GPU for inference? | RTX 3090/4090 for NVIDIA; M4 Max for Apple |
| What quantization for daily use? | Q4_K_M as default; Q5_K_M or Q6_K if VRAM allows |
| When to use vLLM vs llama.cpp? | vLLM for multi-user throughput; llama.cpp for single-user or sovereign |
The memory wall is the first thing to understand about consumer inference. Everything else — engine choice, quantization strategy, hardware selection — is a derivation from that single fact.
1. Wallace, E. et al., 2024. "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions." arXiv:2404.13208. OpenAI. https://arxiv.org/html/2404.13208v1
2. Locara, 2026. "LLM Memory Math — Parameters, KV Cache, Bandwidth, and What Actually Fits." https://locara.dev/docs/notes/llm-memory-math
3. AI/TLDR, 2026. "LLM Hardware Guide: How Much RAM and VRAM You Need." https://ai-tldr.dev/learn/local-open-models/running-models-locally/local-llm-hardware-requirements/
4. Hardware Corner, 2026. "How Memory Chips Determine GPU Memory Bandwidth for Local LLM Inference." https://www.hardware-corner.net/gddr-chips-and-llm-bandwidth/
5. ai.rs, 2026. "The GPU Memory Wall: Why Inference Hardware Matters." https://ai.rs/ai-developer/gpu-memory-wall-inference-hardware
6. Hiesch, A., 2026. "llama.cpp on Apple Silicon: 29 GGUF Benchmarks and a 200 t/s Surprise." https://hiesch.eu/blog/llamacpp-benchmarks-speculative-decoding/
7. arXiv:2601.19139, 2026. "Native LLM and MLLM Inference at Scale on Apple Silicon." https://doi.org/10.48550/arxiv.2601.19139
8. arXiv:2605.00519, 2026. "Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference." https://arxiv.org/html/2605.00519
9. Khamdee, P., 2026. "How Much VRAM Does Your LLM Actually Need? A Field Guide to Sizing GPUs." https://pkhamdee.blog/2026/06/17/how-much-vram-does-your-llm-actually-need-a-field-guide-to-sizing-gpus/
10. RunPod, 2026. "GPU Memory Sizing Guide for LLM Inference." https://www.runpod.io/articles/guides/gpu-memory-sizing-guide-for-llm-inference
11. Contracollective, 2026. "llama.cpp vs MLX vs Ollama vs vLLM: Local AI Inference for Apple Silicon in 2026." https://contracollective.com/blog/llama-cpp-vs-mlx-ollama-vllm-apple-silicon-2026