Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
You load a 7B model. It fits in 4 GB of VRAM. You think you have headroom.
Then you try to serve four concurrent requests at 8K context, and everything stalls.
The model weights are not the bottleneck. The KV cache is.
---
Every transformer layer, on every forward pass, computes three matrices from the input: Query, Key, and Value. During autoregressive generation — where each new token depends on all previous tokens — recomputing the Keys and Values for every prior token on every step would be catastrophically expensive.
The KV cache stores those intermediate Key and Value tensors so they do not need to be recomputed. Generation becomes a rolling append rather than a full recompute.
This is correct and necessary. It is also the reason your VRAM disappears.
---
For a single request, the KV cache size at a given context length is:
``
KV cache (bytes) = 2 × num_layers × num_heads × head_dim × context_length × bytes_per_element
`
For Llama 3 8B at fp16, with 32 layers, 32 heads, head dimension 128:
- At 2K context: ~1.0 GB
- At 8K context: ~4.0 GB
- At 32K context: ~16.0 GB
That is per request. Four concurrent requests at 8K context require 16 GB of KV cache alone — before the model weights even load.
The model weights for Llama 3 8B in fp16 consume ~16 GB. So on a 24 GB GPU, you have roughly 8 GB left for KV cache. That is two concurrent requests at 8K context. Not four. Not eight.
---
Why Batching Collapses Without KV Management
Naive LLM serving allocates KV cache contiguously, one block per request, at maximum possible context length. This causes two problems.
First, internal fragmentation: a request allocated for 8K tokens but only using 1K wastes 7K worth of cache memory.
Second, external fragmentation: as requests finish and new ones start at different lengths, the available memory becomes a patchwork of unusable gaps.
PagedAttention (introduced with vLLM in 2023) solved this by managing the KV cache the way an OS manages physical memory — in non-contiguous pages, allocated on demand, freed on completion. The result: higher batch sizes from the same hardware, without accuracy loss.
---
What This Means in Practice
Choosing quantization changes your KV budget. Loading weights in INT4 frees VRAM — but if your serving framework keeps KV cache in fp16, the cache still dominates. Some frameworks (llama.cpp, vLLM) support KV quantization independently of weight quantization. Use it.
Context length is not free. Doubling the context length doubles the KV cache per request. If you are running Ollama locally and set
num_ctx 32768` by default, you are reserving 16 GB per request on a model that may not need it. Set context length to match your actual use case.
Concurrency is bounded by cache, not compute. On a 24 GB consumer GPU, the practical ceiling for concurrent 8K-context requests with a 7B model is two to three. Planning for more requires either smaller context windows, KV quantization, or offloading.
Throughput and latency pull in opposite directions. Large batches improve throughput — more tokens generated per second across all users. But each additional request in a batch competes for KV cache memory and increases per-request latency. There is no free optimization here. Pick the metric that matters for your workload.
---
The KV cache is the real memory budget. Everything else — weight quantization, model selection, hardware sizing — is secondary to understanding how many tokens of active context your system can hold simultaneously.
Load the model. Then calculate your KV budget. Then decide how many concurrent users you can serve.
That order matters.
1. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP 2023), pp. 611–626. https://arxiv.org/abs/2309.06180
2. Nawrot, P., Tworkowski, S., Tyber, M., et al. (2025). Inference-Time Hyper-Scaling with KV Cache Compression. arXiv:2506.05345. https://arxiv.org/abs/2506.05345
3. Ye, C., et al. (2025). Characterizing the Behavior and Impact of KV Caching on Transformer Inferences under Concurrency. Illinois Institute of Technology, Gnosis Research Center. https://grc.iit.edu/publications/ye-2025-characterizing-behavior-f631/
4. Tang, X., et al. (2024). XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference. arXiv:2412.05896. https://arxiv.org/abs/2412.05896