Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Every GPU serving an LLM has the same problem: it does not know how long the output will be.
The KV cache — the stored key and value tensors for every token in a sequence — grows as generation proceeds. It starts small. It ends at some unknown length. Before vLLM, every serving system handled this the same way: reserve a contiguous block of memory sized to the maximum possible sequence length, upfront, for every active request.
That decision made sense for simplicity. It was catastrophic for efficiency.
A request that generates 100 tokens in a system configured for 2,048-token sequences holds a 2,048-token slab for its entire lifetime. The unused 1,948 tokens of space are stranded — the scheduler cannot give them to another request. Across a heterogeneous batch, this adds up. The vLLM authors measured that naive contiguous allocation achieves 20–38% GPU memory utilization. The rest is fragmentation.
The solution was not invented in 2023. It was invented in 1962.
Operating systems solved the same problem for RAM decades ago: virtual memory with paging. A program sees a contiguous address space. The OS maps logical pages to physical pages — non-contiguous, on-demand, dynamically allocated and reclaimed. The program does not need to know its maximum memory requirement in advance. The OS fills in physical pages as needed and reuses them as they are freed.
PagedAttention applies exactly this model to the KV cache.
Each sequence's KV cache is divided into fixed-size KV blocks — analogous to OS pages. A block holds the keys and values for a fixed number of tokens (typically 16). Blocks are mapped via a block table: logical block indices to physical block indices on GPU memory. Physical blocks do not need to be contiguous. They are allocated on demand as generation proceeds and freed immediately when a sequence completes.
The arithmetic is direct. With contiguous allocation, waste is proportional to the gap between actual and maximum sequence length. With PagedAttention, waste is bounded by the last block of each sequence — on average half a block. In practice: under 4% memory waste compared to 60–80% before.
That memory recovery is not a bookkeeping win. It translates directly into batch size. More sequences fit in memory simultaneously. More sequences in a batch means higher GPU utilization. Higher utilization means more throughput from the same hardware. The vLLM paper measured 2–4× throughput improvement over prior state-of-the-art systems.
PagedAttention also enables copy-on-write memory sharing. When parallel sampling generates N sequences from the same prompt, the prompt's KV blocks are shared — not duplicated — until the sequences diverge. The block table handles divergence via copy-on-write, exactly as OS page tables do for forked processes.
The insight is not exotic. It is that GPU memory management for LLM serving was, for years, stuck at a level of sophistication that operating systems left behind in the 1960s. The KV cache is dynamic, variable-length, and request-scoped. Contiguous pre-allocation is the wrong abstraction for that access pattern. Paging is the right one.
The hardware is new. The principle is not.
1. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. https://arxiv.org/abs/2309.06180
2. vLLM Team. (2023). vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. vLLM Blog. https://vllm.ai/blog/2023-06-20-vllm
3. vLLM Project. (2024). PagedAttention Design. vLLM Documentation. https://docs.vllm.ai/en/latest/design/paged_attention/
4. Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. ACM SOSP 2023. https://dl.acm.org/doi/10.1145/3600006.3613165