#vllm — devinfo.dev

inspiration
PagedAttention Is an OS Idea

Before PagedAttention, LLM serving systems wasted 60–80% of GPU memory on KV cache fragmentation. The fix was not a new neural architecture — it was a 1960s operating systems concept applied to the wrong layer.
June 23, 2026
inspiration
Prefix Caching Is Free Throughput

Automatic Prefix Caching in vLLM reuses already-computed KV cache blocks across requests that share identical prefixes — delivering 30–50% throughput gains and up to 10x latency reduction at zero engineering cost beyond a single configuration flag.
June 8, 2026
inspiration
Continuous Batching: The Throughput Multiplier

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this at the scheduler level — and the gains are not marginal.
June 7, 2026
whitepaper
Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM

llama.cpp, Ollama, and vLLM are not interchangeable. They solve different problems at different scales. This paper maps the architectural differences, performance characteristics, and deployment tradeoffs to help you pick the right engine for your workload — and understand why the wrong choice costs you in ways that are hard to undo.
May 25, 2026