Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Quantization is often treated as a deployment detail. Download the Q4_K_M GGUF, run it, move on.
That framing is wrong.
When you quantize a model, you are making a precision tradeoff. You are saying: I accept some loss in representational accuracy in exchange for lower memory and faster inference. That tradeoff has consequences — for output quality, for task suitability, for hardware fit.
Those consequences belong in your design decisions, not buried in a filename.
What the quantization level actually controls
A full-precision (F16 or BF16) model represents each weight as a 16-bit float. A Q4 model represents weights in roughly 4 bits. That is a 4x reduction in memory. It is also a reduction in the model's ability to represent subtle distinctions — distinctions that matter more for some tasks than others.
Coding tasks tolerate lower quantization well. The model is pattern-matching against highly structured syntax; the signal is strong. Open-ended reasoning, nuanced instruction following, and multilingual tasks degrade earlier. The loss shows up as subtle flattening — outputs that are almost right but miss the edge case.
Choosing well
The rule of thumb: use the highest quantization you can fit, not the lowest you can get away with. A Q5_K_M or Q6_K model on the same hardware often outperforms a Q4 in ways that matter — and the memory difference is a few gigabytes, not a factor of two.
If you are running Ollama or llama.cpp locally, know what fits in VRAM. A model that partially offloads to RAM is slower than a smaller model that runs fully on-device. Sometimes Q4 of a 13B model beats Q6 of a 7B model for your use case. Benchmark on your task.
The architecture implication
When you write down the model you are using in a system, write down the quantization too. "Llama 3.1 8B" is not a complete specification. "Llama 3.1 8B Q5_K_M, CPU offload 0 layers" is.
Quantization is a coordinate in your design space. Treat it like one.