Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Quantization Is a Design Decision

inspiration | devinfo.dev | May 24, 2026 | devinfo.dev:2026.0004

Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.

Quantization is often treated as a deployment detail. Download the Q4_K_M GGUF, run it, move on.

That framing is wrong.

When you quantize a model, you are making a precision tradeoff. You are saying: I accept some loss in representational accuracy in exchange for lower memory and faster inference. That tradeoff has consequences — for output quality, for task suitability, for hardware fit.

Those consequences belong in your design decisions, not buried in a filename.

What the quantization level actually controls

A full-precision (F16 or BF16) model represents each weight as a 16-bit float. A Q4 model represents weights in roughly 4 bits. That is a 4x reduction in memory. It is also a reduction in the model's ability to represent subtle distinctions — distinctions that matter more for some tasks than others.

Coding tasks tolerate lower quantization well. The model is pattern-matching against highly structured syntax; the signal is strong. Open-ended reasoning, nuanced instruction following, and multilingual tasks degrade earlier. The loss shows up as subtle flattening — outputs that are almost right but miss the edge case.

Choosing well

The rule of thumb: use the highest quantization you can fit, not the lowest you can get away with. A Q5_K_M or Q6_K model on the same hardware often outperforms a Q4 in ways that matter — and the memory difference is a few gigabytes, not a factor of two.

If you are running Ollama or llama.cpp locally, know what fits in VRAM. A model that partially offloads to RAM is slower than a smaller model that runs fully on-device. Sometimes Q4 of a 13B model beats Q6 of a 7B model for your use case. Benchmark on your task.

The architecture implication

When you write down the model you are using in a system, write down the quantization too. "Llama 3.1 8B" is not a complete specification. "Llama 3.1 8B Q5_K_M, CPU offload 0 layers" is.

Quantization is a coordinate in your design space. Treat it like one.

References