Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Continuous Batching: The Throughput Multiplier

inspiration | devinfo.dev | June 7, 2026 | devinfo.dev:2026.0024

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this at the scheduler level — and the gains are not marginal.

Traditional LLM serving waits for every request in a batch to finish before starting the next batch. One long response holds everyone else back. GPU sits idle. Throughput collapses.

Continuous batching solves this at the iteration level.

After every forward pass, the scheduler checks: did any sequence finish? If yes, evict it and slot in a new request immediately. The GPU never waits for stragglers. The batch stays full.

This idea was formalized in Orca (Yu et al., OSDI 2022). The paper reports 36.9× throughput improvement over NVIDIA FasterTransformer on GPT-3 175B at the same latency level. The mechanism is called iteration-level scheduling — scheduling at each generation step, not per request.

One implementation detail matters: attention cannot be naively batched across sequences with different lengths. Orca introduced selective batching — batch everything (Linear, LayerNorm, Add, GeLU) except attention, which runs per-sequence. vLLM later solved this differently using PagedAttention, which handles variable-length KV caches in non-contiguous memory blocks.

What this means in practice:

vLLM implements continuous batching as its default scheduler. The scheduler runs a tight async loop — polling every 5ms — deciding which requests enter the next forward pass. This is the core reason vLLM outperforms naive HuggingFace Transformers pipelines by 10–24× on the same hardware.

If you are self-hosting a model and not using a scheduler that implements iteration-level batching, you are leaving most of your hardware idle.

The upgrade is not a config change. It is a choice of serving stack.

References

1. Yu, G., Jeong, J. S., Kim, G., Kim, S., & Chun, B. G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. USENIX OSDI 2022. https://www.usenix.org/conference/osdi22/presentation/yu

2. Anyscale. (2023). Continuous Batching: Achieve 23x LLM Inference Throughput & Reduce p50 Latency. https://www.anyscale.com/blog/continuous-batching-llm-inference

3. vLLM Project. (2025). vLLM GitHub Repository — Core Features: Continuous Batching, PagedAttention. https://github.com/vllm-project/vLLM

4. vLLM Team. (2025). Large Scale Serving: DeepSeek @ 2.2k tok/s/H200. vLLM Blog. https://vllm.ai/blog/2025-12-17-large-scale-serving

5. RunPod. (2024). vLLM Explained: PagedAttention and Continuous Batching. https://www.runpod.io/articles/guides/vllm-pagedattention-continuous-batching