Sparse Is Not Small
Sparse Is Not Small
Parameter count is the wrong unit of analysis for MoE models.
Mixtral 8x7B has 47 billion parameters. Each token activates 13 billion of them. DeepSeek-V3 has 671 billion parameters. Each token activates roughly 37 billion of them. In both cases, the per-token compute — the FLOPs that actually run during inference — is determined not by the total parameter count but by the routing decision made at each transformer layer.
This is the core idea of Mixture of Experts: sparse conditional activation. Instead of running every parameter on every token, a small learned router selects a fixed number of expert subnetworks per token, sends the token through only those, and discards the rest.
How the Router Works
Each MoE layer replaces a dense feed-forward network with a pool of N expert FFNs plus a lightweight gating network. For each token:
1. The token's hidden state is multiplied by a small router weight matrix, producing N scalar logits — one per expert.
2. The top-k experts by logit score are selected (typically k = 2).
3. The token is processed by only those k experts.
4. Outputs are weighted by normalized gate scores and summed.
The remaining N - k experts execute nothing. They consume memory — they are loaded weights — but zero compute for that token. Capacity grows. Cost does not.
The Load Balancing Problem
The router is learned. Left unconstrained, it collapses: a small number of experts capture most tokens, the rest go undertrained, and you get a larger model that performs like a smaller one.
Two strategies address this:
Auxiliary loss (Mixtral, Switch Transformer): add a differentiable penalty to the training loss that rewards balanced expert utilization. This works but requires careful tuning — too large a penalty and the model learns to route uniformly rather than correctly.
Expert capacity limits (DeepSeek-V3): each expert has a hard cap on tokens per batch. Overflow tokens route to the next-best available expert. No auxiliary loss needed. The constraint is structural, not learned.
DeepSeek-V3 goes further: it introduces a bias term per expert, added to gate scores before top-k selection, that is adjusted dynamically to maintain balance across training without contaminating the main loss.
What This Means for Inference
Running a 671B MoE model does not mean running a 671B dense model. The memory footprint is large — all expert weights must reside in VRAM or be paged. But the compute per token is bounded by the active parameter count, not the total.
For serving, this creates a specific tradeoff: MoE models are memory-bound, not compute-bound. You need enough VRAM to hold the full weight set. But once you have it, throughput per token is comparable to a much smaller dense model.
Expert parallelism — sharding experts across devices so each device holds a subset — is the standard serving strategy. The cost is communication: tokens must be routed across devices to reach their assigned experts. An all-to-all collective replaces the all-reduce used in tensor parallelism. On high-bandwidth interconnects, this is fine. On commodity networking, it becomes the bottleneck.
The Single Claim
Total parameters set a memory floor. Active parameters set the compute ceiling. If you are reasoning about inference cost for a MoE model and using total parameter count as your estimate, your estimate is wrong.
References
- Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017. https://arxiv.org/abs/1701.06538
- Jiang, A. Q. et al. (2024). Mixtral of Experts. Mistral AI. https://arxiv.org/abs/2401.04088
- DeepSeek-AI (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. ACL 2024. https://arxiv.org/abs/2401.06066
- DeepSeek-AI (2024). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437
- Zoph, B. et al. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models. Google Research. https://arxiv.org/abs/2202.08906
- Zhou, Y. et al. (2022). Mixture-of-Experts with Expert Choice Routing. NeurIPS 2022. https://papers.neurips.cc/paper_files/paper/2022/file/2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf
Cite as
devinfo.dev. (2026). "Sparse Is Not Small." devinfo.dev:2026.0038. https://devinfo.dev/d/2026.0038
devinfo.dev | https://devinfo.dev/d/2026.0038
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev