The Scheduler Is Not the Model

You swap in a faster model. Responses get slower.

This is not a paradox. It is a scheduling problem.

---

Every LLM inference engine has two jobs. First, compute — the arithmetic of matrix multiplications through transformer layers. Second, scheduling — the decision about which requests run, in what order, using how many tokens, right now.

Most engineers think about the first job. The second one determines your actual latency.

What the scheduler decides

At each iteration, the scheduler answers three questions:

1. Which requests are in the batch?

2. How many tokens does each request contribute?

3. What happens to requests that don't fit?

These decisions happen on every single forward pass. The answers compound across every request your system serves.

The prefill problem

Prefill is compute-bound. Decode is memory-bound. They contend for the same hardware inside every batch.

In the naive case — process one prefill completely before starting decode — a single long prompt blocks every other request. A user submitting a 16,000-token document freezes all shorter requests behind it. This is head-of-line blocking, and it is entirely a scheduler artifact.

vLLM's chunked prefill breaks the deadlock. Instead of processing the full prompt in one pass, the scheduler splits it across iterations. Each iteration: advance all pending decode requests first, then apply whatever token budget remains to the next prefill chunk. A 16,000-token prompt becomes 32 iterations of 512 tokens — interleaved with decode, not blocking it.

The result is measurable. vLLM's documentation reports P99 inter-token latency dropping from ~50ms to ~15ms under mixed load when chunked prefill is enabled. That is not a model improvement. The weights did not change.

Decode starvation is a scheduling failure

Without chunked prefill, a large prefill occupies the entire batch for many iterations. Decode requests — users waiting for their next token — receive nothing. Inter-token latency spikes. The model is producing tokens correctly. The scheduler is distributing them incorrectly.

The vLLM V1 scheduler names this explicitly: it prioritizes decode requests first, then schedules prefill with the remaining max_num_batched_tokens budget. Decoding cannot be starved by an incoming prefill no matter how large.

Preemption is the escape valve

When memory pressure forces a decision — keep the running request or admit a new one — the scheduler must preempt. Preemption means evicting a request from GPU memory (to CPU swap or recomputation) to make room.

This is the same mechanism operating systems use to context-switch processes. The insight behind FastServe (NSDI '26) is that iteration-level preemption in LLM serving enables a variant of shortest-remaining-process-time scheduling — cutting average job completion time by prioritizing requests closest to finishing, not earliest to arrive.

FCFS (first-come, first-served) is the default in most systems. It is simple and fair. It is also provably suboptimal when request lengths vary significantly.

What this means in practice

The max_num_batched_tokens parameter is a scheduler knob, not a model parameter. Tuning it changes your latency-throughput tradeoff without touching a single weight:

Smaller values (2048): lower inter-token latency, prefill takes more iterations
Larger values (16384+): better time-to-first-token, more prefill per batch
For throughput-optimized workloads: set above 8192

The model does not know this parameter exists. The scheduler owns it entirely.

The principle

When LLM serving performance is underspec, engineers reach for a better model. Sometimes that is correct. More often, the model is fine and the scheduler is wrong.

The scheduler is not a configuration detail. It is the control plane of your inference system. It determines which users wait, for how long, and why.

Understand it before you upgrade the model.

References

vLLM Project. "Optimization and Tuning — Chunked Prefill." vLLM Documentation. https://docs.vllm.ai/en/stable/configuration/optimization/
Wang, Audrey. "Understanding vLLM Scheduling: Token Budgets, Chunked Prefill, and Policies." Medium, March 2026. https://audreywongkg.medium.com/understanding-vllm-scheduling-token-budgets-chunked-prefill-and-policies-2c879e3980e3
Wu, Bingyang et al. "FastServe: Iteration-Level Preemptive Scheduling for Large Language Model Inference." NSDI '26. https://www.usenix.org/system/files/conference/nsdi26/nsdi26spring_wu-bingyang_prepub.pdf
SqueezeBits. "vLLM vs TensorRT-LLM #4: Which Scheduler Wins?" Blog, October 2024. https://blog.squeezebits.com/vllm-vs-tensorrtllm-4-which-scheduler-wins--33083
vLLM Project. "Scheduler API — PartialPrefillMetadata." vLLM v0.9.1 API Docs. https://docs.vllm.ai/en/v0.9.1/api/vllm/core/scheduler.html