Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Attention Heads Are Not Equal

inspiration | devinfo.dev | June 26, 2026 | devinfo.dev:2026.0046

Multi-head attention gives every query its own key and value heads. That is thorough — and expensive. Grouped-Query Attention proves the redundancy: Llama 3 70B serves 64 query heads from 8 KV heads, cuts its KV cache by 8x, and loses almost nothing in quality.

Attention Heads Are Not Equal

In standard multi-head attention (MHA), every query head has its own key head and value head. A model with 64 query heads stores 64 sets of K and V tensors per layer, per token. That is the full cost — paid on every decode step, for every token generated.

It is also largely wasteful.

The bandwidth problem

Autoregressive decoding is memory-bandwidth-bound, not compute-bound. At each step you generate one token. To do that, you load the entire KV cache from GPU memory — one K vector and one V vector per head, per layer, per prior token. The arithmetic is trivial. The loading is not.

Noam Shazeer identified this in 2019. His fix: Multi-Query Attention (MQA). All query heads share a single key head and a single value head. The KV cache shrinks by a factor of H — the number of heads. Decode gets dramatically faster. Quality drops, sometimes enough to matter.

The interpolation

Grouped-Query Attention (GQA), introduced by Ainslie et al. at EMNLP 2023, sits between the extremes. Instead of one KV head for all queries (MQA) or one KV head per query (MHA), GQA assigns one KV head per group of query heads. With G groups, the KV cache is reduced by a factor of H/G.

The design space becomes a dial:

Ainslie et al. found G = 8 to be a favorable middle ground — "increasing the number of groups from MQA only results in modest slowdowns initially, with increasing cost as we move closer to MHA."

The numbers in practice

Llama 2's 7B and 13B models used full MHA. Its 34B and 70B models switched to GQA. Llama 3 applied GQA across all sizes.

The concrete configuration for Llama 3 70B: 64 query heads, 8 KV heads. The KV cache reduction is exactly 8x. At a sequence length of 2,048 tokens with a batch of 32, that is the difference between 20.9 GB and 2.6 GB of KV cache. The same factor applies to memory bandwidth during the attention step — which is what governs decode latency.

Mistral 7B made the same choice: 32 query heads, 8 KV heads — a 4x reduction relative to an MHA baseline of the same size.

The quality cost is nearly absent. Empirically, GQA with G = 8 groups recovers near-MHA quality after a brief fine-tuning phase (5% of original training compute, per the Ainslie et al. recipe for uptraining existing checkpoints).

Why this matters when you are running inference

If you are serving a model on constrained hardware — a 24 GB GPU, a cloud ARM instance, a machine without a GPU at all — the KV cache is your binding constraint. Not the weights. The weights are static; you load them once. The KV cache grows with every token you generate and every request you serve concurrently.

A model that uses GQA with 8 groups gives you 8x more room in that budget. You can serve longer contexts. You can serve more concurrent requests. You can run on a smaller machine.

MHA is not wrong. For small, quality-sensitive models on unrestricted hardware, it remains the right choice. But for deployment — especially at the intersection of limited memory and long sequences — GQA is not a compromise. It is a better design.

The insight Shazeer had in 2019 and Ainslie et al. formalized in 2023 is simple: query heads are diverse; key and value heads are redundant. You do not need a separate KV projection for every query. You need enough. Eight is usually enough.

References