inspiration

Continuous Batching: The Throughput Multiplier

devinfo.dev — June 7, 2026

devinfo.dev:2026.0024

#inference #vllm #throughput #self-hosted

Save as PDF

Traditional LLM serving waits for every request in a batch to finish before starting the next batch. One long response holds everyone else back. GPU sits idle. Throughput collapses.

Continuous batching solves this at the iteration level.

After every forward pass, the scheduler checks: did any sequence finish? If yes, evict it and slot in a new request immediately. The GPU never waits for stragglers. The batch stays full.

This idea was formalized in Orca (Yu et al., OSDI 2022). The paper reports 36.9× throughput improvement over NVIDIA FasterTransformer on GPT-3 175B at the same latency level. The mechanism is called iteration-level scheduling — scheduling at each generation step, not per request.

One implementation detail matters: attention cannot be naively batched across sequences with different lengths. Orca introduced selective batching — batch everything (Linear, LayerNorm, Add, GeLU) except attention, which runs per-sequence. vLLM later solved this differently using PagedAttention, which handles variable-length KV caches in non-contiguous memory blocks.

What this means in practice:

Static batching: GPU utilization hovers at 30–40%. Long tails kill throughput.
Continuous batching: GPU utilization reaches 80%+. Performance becomes length-agnostic.
At 64+ concurrent requests, gains over static batching exceed 30×.

vLLM implements continuous batching as its default scheduler. The scheduler runs a tight async loop — polling every 5ms — deciding which requests enter the next forward pass. This is the core reason vLLM outperforms naive HuggingFace Transformers pipelines by 10–24× on the same hardware.

If you are self-hosting a model and not using a scheduler that implements iteration-level batching, you are leaving most of your hardware idle.

The upgrade is not a config change. It is a choice of serving stack.

References

1. Yu, G., Jeong, J. S., Kim, G., Kim, S., & Chun, B. G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. USENIX OSDI 2022. https://www.usenix.org/conference/osdi22/presentation/yu

2. Anyscale. (2023). Continuous Batching: Achieve 23x LLM Inference Throughput & Reduce p50 Latency. https://www.anyscale.com/blog/continuous-batching-llm-inference

3. vLLM Project. (2025). vLLM GitHub Repository — Core Features: Continuous Batching, PagedAttention. https://github.com/vllm-project/vLLM

4. vLLM Team. (2025). Large Scale Serving: DeepSeek @ 2.2k tok/s/H200. vLLM Blog. https://vllm.ai/blog/2025-12-17-large-scale-serving

5. RunPod. (2024). vLLM Explained: PagedAttention and Continuous Batching. https://www.runpod.io/articles/guides/vllm-pagedattention-continuous-batching

Cite as

devinfo.dev. (2026). "Continuous Batching: The Throughput Multiplier." devinfo.dev:2026.0024. https://devinfo.dev/d/2026.0024

devinfo.dev | https://devinfo.dev/d/2026.0024
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev

References

Cite as

See also