Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
LLM inference has two phases. Most engineers focus on one.
Decode is what you see: tokens streaming out, one after another. It is memory-bandwidth-bound, sequential by nature, and well-understood. Every inference optimization article talks about decode.
Prefill is what happens first. You send a prompt — possibly thousands of tokens long — and the model processes all of them in a single forward pass before it outputs anything. Prefill is compute-bound, not memory-bound. It is fast per-token but blocks everything.
The first token you receive is not fast. It is delayed by the entire prefill computation.
TTFT — time to first token — is the interval between sending the prompt and receiving the first output token. It is dominated by prefill.
On a long system prompt plus user message (say, 4,000 tokens), TTFT on a single A100 can run 200–500 ms. The model is not slow. It is busy. It is running a full forward pass over every one of those tokens before it can say anything.
The problem compounds under load. If one request is prefilling a long prompt, shorter requests queue behind it and wait — even though they could have started decoding milliseconds ago. This is called head-of-line blocking, and it is the first thing that breaks a responsive inference system at scale.
The standard scheduler processes prefill in full before moving to decode. Chunked prefill breaks that assumption.
Instead of processing a 4,000-token prompt in one shot, the scheduler splits it into chunks — say, 512 tokens per iteration — and interleaves those chunks with decode steps from other requests. The prefill still takes the same total compute. But it no longer monopolizes the GPU for the full duration.
The effect: inter-token latency (ITL) drops 10–20% at moderate load, and nearly 2x at high QPS — because decode requests are no longer fully blocked behind prefill. Throughput also improves, by up to 50% in production deployments, because the GPU utilization profile flattens out.
The tradeoff: TTFT for the chunked request itself increases slightly, because the prefill is now spread across more iterations. You are trading one request's first-token latency for better latency across all other concurrent requests. At scale, this is almost always the right trade.
vLLM exposes this as --enable-chunked-prefill with a configurable --max-num-batched-tokens chunk size. NVIDIA TensorRT-LLM implements dynamic chunk sizing that auto-tunes based on GPU utilization metrics.
Chunked prefill mitigates head-of-line blocking. It does not eliminate the fundamental tension: prefill is compute-bound, decode is memory-bandwidth-bound. They are different workloads running on the same hardware.
Prefill-decode disaggregation separates them entirely. Dedicated prefill nodes process incoming prompts. Dedicated decode nodes generate tokens. The KV cache is transferred between them after prefill completes.
DistServe (OSDI '24) demonstrated 4.48x goodput improvement over vLLM on tight latency SLOs by disaggregating prefill and decode. The cost: you need more hardware, and you need fast interconnect to transfer KV caches between nodes.
For most self-hosted deployments, chunked prefill is the practical answer. For high-traffic production systems, disaggregation is the principled one.
If your system has high TTFT, the cause is almost certainly prefill. Check:
1. How long is your system prompt? Every token in it is prefilled on every request.
2. Are you batching requests together? A long-prompt request will stall all short ones behind it.
3. Have you enabled prefix caching? A static system prompt only needs to be prefilled once if the KV cache is retained.
4. Have you enabled chunked prefill? On vLLM, it is off by default.
The model is rarely the bottleneck. The scheduler is.