inspiration

Sparse Is Not Small

devinfo.dev — June 19, 2026

devinfo.dev:2026.0038

Sparse Is Not Small

Parameter count is the wrong unit of analysis for MoE models.

Mixtral 8x7B has 47 billion parameters. Each token activates 13 billion of them. DeepSeek-V3 has 671 billion parameters. Each token activates roughly 37 billion of them. In both cases, the per-token compute — the FLOPs that actually run during inference — is determined not by the total parameter count but by the routing decision made at each transformer layer.

This is the core idea of Mixture of Experts: sparse conditional activation. Instead of running every parameter on every token, a small learned router selects a fixed number of expert subnetworks per token, sends the token through only those, and discards the rest.

How the Router Works

Each MoE layer replaces a dense feed-forward network with a pool of N expert FFNs plus a lightweight gating network. For each token:

1. The token's hidden state is multiplied by a small router weight matrix, producing N scalar logits — one per expert.

2. The top-k experts by logit score are selected (typically k = 2).

3. The token is processed by only those k experts.

4. Outputs are weighted by normalized gate scores and summed.

The remaining N - k experts execute nothing. They consume memory — they are loaded weights — but zero compute for that token. Capacity grows. Cost does not.

The Load Balancing Problem

The router is learned. Left unconstrained, it collapses: a small number of experts capture most tokens, the rest go undertrained, and you get a larger model that performs like a smaller one.

Two strategies address this:

Auxiliary loss (Mixtral, Switch Transformer): add a differentiable penalty to the training loss that rewards balanced expert utilization. This works but requires careful tuning — too large a penalty and the model learns to route uniformly rather than correctly.

Expert capacity limits (DeepSeek-V3): each expert has a hard cap on tokens per batch. Overflow tokens route to the next-best available expert. No auxiliary loss needed. The constraint is structural, not learned.

DeepSeek-V3 goes further: it introduces a bias term per expert, added to gate scores before top-k selection, that is adjusted dynamically to maintain balance across training without contaminating the main loss.

What This Means for Inference

Running a 671B MoE model does not mean running a 671B dense model. The memory footprint is large — all expert weights must reside in VRAM or be paged. But the compute per token is bounded by the active parameter count, not the total.

For serving, this creates a specific tradeoff: MoE models are memory-bound, not compute-bound. You need enough VRAM to hold the full weight set. But once you have it, throughput per token is comparable to a much smaller dense model.

Expert parallelism — sharding experts across devices so each device holds a subset — is the standard serving strategy. The cost is communication: tokens must be routed across devices to reach their assigned experts. An all-to-all collective replaces the all-reduce used in tensor parallelism. On high-bandwidth interconnects, this is fine. On commodity networking, it becomes the bottleneck.

The Single Claim

Total parameters set a memory floor. Active parameters set the compute ceiling. If you are reasoning about inference cost for a MoE model and using total parameter count as your estimate, your estimate is wrong.

References

  • Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017. https://arxiv.org/abs/1701.06538
  • Jiang, A. Q. et al. (2024). Mixtral of Experts. Mistral AI. https://arxiv.org/abs/2401.04088
  • DeepSeek-AI (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. ACL 2024. https://arxiv.org/abs/2401.06066
  • DeepSeek-AI (2024). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437
  • Zoph, B. et al. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models. Google Research. https://arxiv.org/abs/2202.08906
  • Zhou, Y. et al. (2022). Mixture-of-Experts with Expert Choice Routing. NeurIPS 2022. https://papers.neurips.cc/paper_files/paper/2022/file/2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf

Cite as

devinfo.dev. (2026). "Sparse Is Not Small." devinfo.dev:2026.0038. https://devinfo.dev/d/2026.0038

devinfo.dev | https://devinfo.dev/d/2026.0038
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev