#mixture-of-experts
1 paper
-
inspiration
Sparse Is Not Small
A model with 671 billion parameters can cost less to run than a 70 billion dense model. That is not a marketing claim — it is arithmetic. Mixture of Experts replaces a full forward pass with a routing decision, and the routing decision is the cost model.