Distillation Is Not Compression
Distillation Is Not Compression
Quantization and distillation are both called "model compression." That label obscures what they actually do.
Quantization modifies an existing model. You take its weights — stored as 32-bit or 16-bit floats — and represent them at lower precision: INT8, INT4, sometimes lower. The architecture does not change. The parameter count does not change. The operation is largely reversible. It requires no training data and no training run. It takes minutes to hours.
Distillation trains a new model. A small model — the student — is trained to reproduce the output distribution of a large model — the teacher. Not just the final prediction: the full probability distribution across tokens, including the near-misses. That distribution encodes more than a hard label ever could. It encodes the teacher's uncertainty, its ranked alternatives, its implicit knowledge of what was almost right.
The mechanism was formalized in 2015 by Hinton, Vinyals, and Dean. The key insight: a softmax output at high temperature reveals structure that a one-hot label hides. At temperature T=1, the model says "cat: 97%, dog: 2%, horse: 1%." At T=5, it says "cat: 45%, dog: 35%, horse: 20%." The second distribution is more useful for training a student because it encodes similarity structure. The student learns from that structure — not just from what was right, but from how wrong the wrong answers were.
DistilBERT (Sanh et al., 2019) showed this at scale: 40% fewer parameters, 60% faster inference, 97% of BERT's language understanding performance on GLUE. Not a compressed BERT — a new, smaller model that learned from BERT's behavior.
The tradeoff is cost. Distillation requires a training run. You need data, GPU hours, and a teacher model serving inference throughout training. Quantization requires none of that. It is fast, cheap, and applies to any model you already have.
In production you often do both: distill first to a smaller architecture, then quantize for deployment. The distilled model sets the capability ceiling. Quantization lowers the serving cost without crossing it.
The engineering decision is not "which compression method." It is a question about what resource you are optimizing against:
- Time and training budget unavailable? Quantize.
- Latency target that quantization alone cannot reach? Distill.
- Both? Distill to size, then quantize to deployment precision.
Treating distillation as compression gets the mental model wrong from the start. Compression reduces what you already have. Distillation builds something new that has been trained to know what the large model knows.
Those are not the same operation.
References
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531. https://arxiv.org/abs/1503.02531
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108. https://arxiv.org/abs/1910.01108
- Gu, Y., et al. (2024). MiniLLM: Knowledge Distillation of Large Language Models. ICLR 2024. https://arxiv.org/abs/2306.08543
- Xu, Z., et al. (2024). A Survey on Knowledge Distillation of Large Language Models. arXiv:2402.13116. https://arxiv.org/abs/2402.13116
- Tian, P. (2026). The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features. https://tianpan.co/blog/2026-04-17-model-compression-quantization-distillation-on-device
Cite as
devinfo.dev. (2026). "Distillation Is Not Compression." devinfo.dev:2026.0044. https://devinfo.dev/d/2026.0044
devinfo.dev | https://devinfo.dev/d/2026.0044
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev