Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
You download a .gguf file. You point llama.cpp or Ollama at it. It works.
What you downloaded is not a weight dump. It is a binary container format — and the design decisions inside it explain most of why self-hosted inference became practical.
GGUF (GGML Universal File Format) replaced GGML and its short-lived successor GGJT in August 2023. The core problem with the old formats: they stored model hyperparameters as a fixed, untyped list. Add a new field, break every existing reader. The format was fragile by design.
GGUF fixed this with a typed key-value metadata structure. New metadata can be appended without touching existing entries. Readers that do not recognize a key skip it safely. The format is forward-compatible by construction.
A GGUF file has four sections: a magic header (GGUF + version number), a metadata block, a tensor info table, and the raw tensor data.
The metadata block is what changes everything:
1. The model architecture. Context length, attention heads, feed-forward dimensions, RoPE frequency base — all embedded. No external config file required.
2. The tokenizer. Vocabulary, merge rules, special tokens, chat template — all present. One file, no tokenizer.json sidecar.
3. The quantization scheme, per tensor. Embedding layers can be F16 while attention layers are Q4_K_M and feed-forward layers are Q6_K. Mixed precision is first-class, not a hack.
The tensor data section is aligned to a hardware boundary (default 32 bytes). That alignment is not cosmetic — it enables memory-mapping. The OS pages in only what is accessed. A 7B model loads in under a second. A 70B model can be partially loaded across CPU and GPU without copying the full file into RAM.
GGML was built for Llama. When Falcon, Mistral, Phi, and Qwen arrived, each needed custom parsing code. GGUF's generic key-value metadata handles every architecture with the same reader. Over 100 architectures are now supported.
The other reason: GGML had no embedded tokenizer. Every model needed external Python scripts at load time. GGUF eliminated that dependency entirely. The file is the model.
When Hugging Face standardized on GGUF for quantized model distribution, self-hosted inference crossed a threshold. The format made the problem of "getting a model running" trivially solvable. The hard problem shifted upstream — to quantization quality, context length, and throughput — where it belongs.
GGUF did not make models better. It made distribution correct. That is enough.