inspiration

Sampling Is a Filter

devinfo.dev — June 18, 2026

devinfo.dev:2026.0037

#inference #sampling #llm-engineering #decoding

Save as PDF

Sampling Is a Filter

Every token an LLM generates is drawn from a probability distribution over the entire vocabulary. Before that draw happens, almost every production system applies a filter to that distribution. Most engineers set these filters by cargo-culting defaults. That is a mistake.

There are three filters in common use. They are not interchangeable.

Top-k

Top-k keeps the k highest-probability tokens and discards everything else. Simple. Predictable. And structurally broken for one reason: the value of k is fixed, but the shape of the distribution is not.

When the model is confident — one token dominates with 0.95 probability — top-k=50 floods the candidate pool with 49 near-zero noise tokens. When the model is uncertain — 200 tokens each carry ~0.005 probability — top-k=50 discards 150 plausible candidates. The filter does not adapt. Fan et al. (2018) introduced top-k as a truncation heuristic. It was never claimed to be optimal.

Top-p (Nucleus Sampling)

Top-p fixes the shape problem. Instead of a fixed count, it keeps the smallest set of tokens whose cumulative probability exceeds p. When the model is confident, the nucleus is small. When the model is uncertain, the nucleus expands.

Holtzman et al. (2020) introduced nucleus sampling specifically because top-k's fixed pool size was causing degenerate outputs — either incoherent text from noise tokens, or repetitive text from over-truncation. Top-p adapts to distribution entropy. That is its strength.

Its weakness: at high temperatures, the distribution flattens. A flattened distribution with p=0.9 still includes a large number of tokens — including low-probability ones that only cleared the threshold because the distribution is warm. Top-p is relative to cumulative mass, not to the top token's confidence.

Min-p

Min-p addresses this. Instead of a cumulative mass threshold, min-p sets a floor: any token whose probability is below (min_p × P_max) is discarded. The threshold scales with the model's confidence in its top prediction.

When the model is very confident (P_max is high), the floor is high and the candidate pool is tight. When the model is uncertain (P_max is low), the floor drops and the pool widens. The filter is proportional to the model's own signal.

Nguyen et al. (2024) showed that min-p outperforms top-p and top-k on both quality and diversity benchmarks, particularly at high temperatures. It is now available natively in Hugging Face Transformers and vLLM.

The Failure Modes Are Different

Top-k fails when distribution entropy does not match your fixed k. It over-includes at low entropy, over-excludes at high entropy.
Top-p fails at high temperatures. A warm distribution makes the nucleus too large. You get diversity, but coherence degrades.
Min-p fails if min_p is set too high on genuinely uncertain distributions — you discard valid candidates because you calibrated for a more confident model.

Each filter embeds an assumption about the distribution's shape. Applying the wrong filter to the wrong distribution is not a style choice. It produces structurally different outputs — and those outputs fail in structurally different ways.

What This Means in Practice

If you are running inference at high temperature for creative tasks, top-p is the wrong filter. The nucleus expands with the temperature, and you lose the coherence benefit you were trying to preserve. Min-p is the right tool: it keeps the threshold proportional to the model's confidence regardless of temperature.

If you are running inference at low temperature for factual tasks, the differences narrow. The distribution is already tight. Top-p, min-p, and even top-k will produce similar candidate pools.

The decision is not about creativity vs. precision. It is about what assumption the filter makes about the distribution — and whether that assumption matches your model's actual behavior at your chosen temperature.

Know what you are applying. Know why.

References

Fan, A., Lewis, M., & Dauphin, Y. (2018). Hierarchical Neural Story Generation. ACL 2018. https://arxiv.org/abs/1805.04833
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. ICLR 2020. https://arxiv.org/abs/1904.09751
Nguyen, M., et al. (2024). Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs. arXiv:2407.01082. https://arxiv.org/abs/2407.01082
Huyen, C. (2024). Generation configurations: temperature, top-k, top-p, and test time compute. huyenchip.com. https://huyenchip.com/2024/01/16/sampling.html

Cite as

devinfo.dev. (2026). "Sampling Is a Filter." devinfo.dev:2026.0037. https://devinfo.dev/d/2026.0037

devinfo.dev | https://devinfo.dev/d/2026.0037
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev

Sampling Is a Filter

Top-k

Top-p (Nucleus Sampling)

Min-p

The Failure Modes Are Different

What This Means in Practice

References

Cite as

See also