inspiration

Sparse Is Not Small

devinfo.dev — June 19, 2026

devinfo.dev:2026.0038

#inference #mixture-of-experts #architecture #llm-serving

Save as PDF

Sparse Is Not Small

Parameter count is the wrong unit of analysis for MoE models.

Mixtral 8x7B has 47 billion parameters. Each token activates 13 billion of them. DeepSeek-V3 has 671 billion parameters. Each token activates roughly 37 billion of them. In both cases, the per-token compute — the FLOPs that actually run during inference — is determined not by the total parameter count but by the routing decision made at each transformer layer.

This is the core idea of Mixture of Experts: sparse conditional activation. Instead of running every parameter on every token, a small learned router selects a fixed number of expert subnetworks per token, sends the token through only those, and discards the rest.

How the Router Works

Each MoE layer replaces a dense feed-forward network with a pool of N expert FFNs plus a lightweight gating network. For each token:

1. The token's hidden state is multiplied by a small router weight matrix, producing N scalar logits — one per expert.

2. The top-k experts by logit score are selected (typically k = 2).

3. The token is processed by only those k experts.

4. Outputs are weighted by normalized gate scores and summed.

The remaining N - k experts execute nothing. They consume memory — they are loaded weights — but zero compute for that token. Capacity grows. Cost does not.

The Load Balancing Problem

The router is learned. Left unconstrained, it collapses: a small number of experts capture most tokens, the rest go undertrained, and you get a larger model that performs like a smaller one.

Two strategies address this:

Auxiliary loss (Mixtral, Switch Transformer): add a differentiable penalty to the training loss that rewards balanced expert utilization. This works but requires careful tuning — too large a penalty and the model learns to route uniformly rather than correctly.

Expert capacity limits (DeepSeek-V3): each expert has a hard cap on tokens per batch. Overflow tokens route to the next-best available expert. No auxiliary loss needed. The constraint is structural, not learned.

DeepSeek-V3 goes further: it introduces a bias term per expert, added to gate scores before top-k selection, that is adjusted dynamically to maintain balance across training without contaminating the main loss.

What This Means for Inference

Running a 671B MoE model does not mean running a 671B dense model. The memory footprint is large — all expert weights must reside in VRAM or be paged. But the compute per token is bounded by the active parameter count, not the total.

For serving, this creates a specific tradeoff: MoE models are memory-bound, not compute-bound. You need enough VRAM to hold the full weight set. But once you have it, throughput per token is comparable to a much smaller dense model.

Expert parallelism — sharding experts across devices so each device holds a subset — is the standard serving strategy. The cost is communication: tokens must be routed across devices to reach their assigned experts. An all-to-all collective replaces the all-reduce used in tensor parallelism. On high-bandwidth interconnects, this is fine. On commodity networking, it becomes the bottleneck.

The Single Claim

Total parameters set a memory floor. Active parameters set the compute ceiling. If you are reasoning about inference cost for a MoE model and using total parameter count as your estimate, your estimate is wrong.

References

Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017. https://arxiv.org/abs/1701.06538
Jiang, A. Q. et al. (2024). Mixtral of Experts. Mistral AI. https://arxiv.org/abs/2401.04088
DeepSeek-AI (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. ACL 2024. https://arxiv.org/abs/2401.06066
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437
Zoph, B. et al. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models. Google Research. https://arxiv.org/abs/2202.08906
Zhou, Y. et al. (2022). Mixture-of-Experts with Expert Choice Routing. NeurIPS 2022. https://papers.neurips.cc/paper_files/paper/2022/file/2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf

Cite as

devinfo.dev. (2026). "Sparse Is Not Small." devinfo.dev:2026.0038. https://devinfo.dev/d/2026.0038

devinfo.dev | https://devinfo.dev/d/2026.0038
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev

Sparse Is Not Small

How the Router Works

The Load Balancing Problem

What This Means for Inference

The Single Claim

References

Cite as

See also