inspiration

Prefill Is the Stall

devinfo.dev — June 21, 2026

devinfo.dev:2026.0040

Prefill Is the Stall

LLM inference has two phases. Most engineers focus on one.

Decode is what you see: tokens streaming out, one after another. It is memory-bandwidth-bound, sequential by nature, and well-understood. Every inference optimization article talks about decode.

Prefill is what happens first. You send a prompt — possibly thousands of tokens long — and the model processes all of them in a single forward pass before it outputs anything. Prefill is compute-bound, not memory-bound. It is fast per-token but blocks everything.

The first token you receive is not fast. It is delayed by the entire prefill computation.

The Metric That Exposes It

TTFT — time to first token — is the interval between sending the prompt and receiving the first output token. It is dominated by prefill.

On a long system prompt plus user message (say, 4,000 tokens), TTFT on a single A100 can run 200–500 ms. The model is not slow. It is busy. It is running a full forward pass over every one of those tokens before it can say anything.

The problem compounds under load. If one request is prefilling a long prompt, shorter requests queue behind it and wait — even though they could have started decoding milliseconds ago. This is called head-of-line blocking, and it is the first thing that breaks a responsive inference system at scale.

What Chunked Prefill Does

The standard scheduler processes prefill in full before moving to decode. Chunked prefill breaks that assumption.

Instead of processing a 4,000-token prompt in one shot, the scheduler splits it into chunks — say, 512 tokens per iteration — and interleaves those chunks with decode steps from other requests. The prefill still takes the same total compute. But it no longer monopolizes the GPU for the full duration.

The effect: inter-token latency (ITL) drops 10–20% at moderate load, and nearly 2x at high QPS — because decode requests are no longer fully blocked behind prefill. Throughput also improves, by up to 50% in production deployments, because the GPU utilization profile flattens out.

The tradeoff: TTFT for the chunked request itself increases slightly, because the prefill is now spread across more iterations. You are trading one request's first-token latency for better latency across all other concurrent requests. At scale, this is almost always the right trade.

vLLM exposes this as --enable-chunked-prefill with a configurable --max-num-batched-tokens chunk size. NVIDIA TensorRT-LLM implements dynamic chunk sizing that auto-tunes based on GPU utilization metrics.

The Deeper Problem: Disaggregation

Chunked prefill mitigates head-of-line blocking. It does not eliminate the fundamental tension: prefill is compute-bound, decode is memory-bandwidth-bound. They are different workloads running on the same hardware.

Prefill-decode disaggregation separates them entirely. Dedicated prefill nodes process incoming prompts. Dedicated decode nodes generate tokens. The KV cache is transferred between them after prefill completes.

DistServe (OSDI '24) demonstrated 4.48x goodput improvement over vLLM on tight latency SLOs by disaggregating prefill and decode. The cost: you need more hardware, and you need fast interconnect to transfer KV caches between nodes.

For most self-hosted deployments, chunked prefill is the practical answer. For high-traffic production systems, disaggregation is the principled one.

What This Means in Practice

If your system has high TTFT, the cause is almost certainly prefill. Check:

1. How long is your system prompt? Every token in it is prefilled on every request.

2. Are you batching requests together? A long-prompt request will stall all short ones behind it.

3. Have you enabled prefix caching? A static system prompt only needs to be prefilled once if the KV cache is retained.

4. Have you enabled chunked prefill? On vLLM, it is off by default.

The model is rarely the bottleneck. The scheduler is.

References

  • Agrawal, A. et al. (2024). Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2403.02310. https://arxiv.org/abs/2403.02310
  • Zhong, Y. et al. (2024). DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. OSDI '24. https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf
  • vLLM Project. Disaggregated Prefilling (experimental). vLLM Documentation. https://docs.vllm.ai/en/v0.10.1.1/features/disagg_prefill.html
  • vLLM Project. Performance and Tuning — Chunked Prefill. vLLM Documentation. https://docs.vllm.ai/en/v0.4.2/models/performance.html
  • NVIDIA. (2024). Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill. NVIDIA Technical Blog. https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/
  • TNG Technology Consulting. (2025). Prefill and Decode for Concurrent Requests — Optimizing LLM Performance. Hugging Face Blog. https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests

Cite as

devinfo.dev. (2026). "Prefill Is the Stall." devinfo.dev:2026.0040. https://devinfo.dev/d/2026.0040

devinfo.dev | https://devinfo.dev/d/2026.0040
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev