Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Sampling Is a Filter

inspiration | devinfo.dev | June 18, 2026 | devinfo.dev:2026.0037

Top-k, top-p, and min-p are not interchangeable dials. Each one cuts the probability distribution at a different seam — and each has a failure mode that the others do not. Knowing which filter you are applying is a prerequisite to reasoning about your outputs.

Sampling Is a Filter

Every token an LLM generates is drawn from a probability distribution over the entire vocabulary. Before that draw happens, almost every production system applies a filter to that distribution. Most engineers set these filters by cargo-culting defaults. That is a mistake.

There are three filters in common use. They are not interchangeable.

Top-k

Top-k keeps the k highest-probability tokens and discards everything else. Simple. Predictable. And structurally broken for one reason: the value of k is fixed, but the shape of the distribution is not.

When the model is confident — one token dominates with 0.95 probability — top-k=50 floods the candidate pool with 49 near-zero noise tokens. When the model is uncertain — 200 tokens each carry ~0.005 probability — top-k=50 discards 150 plausible candidates. The filter does not adapt. Fan et al. (2018) introduced top-k as a truncation heuristic. It was never claimed to be optimal.

Top-p (Nucleus Sampling)

Top-p fixes the shape problem. Instead of a fixed count, it keeps the smallest set of tokens whose cumulative probability exceeds p. When the model is confident, the nucleus is small. When the model is uncertain, the nucleus expands.

Holtzman et al. (2020) introduced nucleus sampling specifically because top-k's fixed pool size was causing degenerate outputs — either incoherent text from noise tokens, or repetitive text from over-truncation. Top-p adapts to distribution entropy. That is its strength.

Its weakness: at high temperatures, the distribution flattens. A flattened distribution with p=0.9 still includes a large number of tokens — including low-probability ones that only cleared the threshold because the distribution is warm. Top-p is relative to cumulative mass, not to the top token's confidence.

Min-p

Min-p addresses this. Instead of a cumulative mass threshold, min-p sets a floor: any token whose probability is below (min_p × P_max) is discarded. The threshold scales with the model's confidence in its top prediction.

When the model is very confident (P_max is high), the floor is high and the candidate pool is tight. When the model is uncertain (P_max is low), the floor drops and the pool widens. The filter is proportional to the model's own signal.

Nguyen et al. (2024) showed that min-p outperforms top-p and top-k on both quality and diversity benchmarks, particularly at high temperatures. It is now available natively in Hugging Face Transformers and vLLM.

The Failure Modes Are Different

Each filter embeds an assumption about the distribution's shape. Applying the wrong filter to the wrong distribution is not a style choice. It produces structurally different outputs — and those outputs fail in structurally different ways.

What This Means in Practice

If you are running inference at high temperature for creative tasks, top-p is the wrong filter. The nucleus expands with the temperature, and you lose the coherence benefit you were trying to preserve. Min-p is the right tool: it keeps the threshold proportional to the model's confidence regardless of temperature.

If you are running inference at low temperature for factual tasks, the differences narrow. The distribution is already tight. Top-p, min-p, and even top-k will produce similar candidate pools.

The decision is not about creativity vs. precision. It is about what assumption the filter makes about the distribution — and whether that assumption matches your model's actual behavior at your chosen temperature.

Know what you are applying. Know why.

References