
A Mixture-of-Experts (MoE) is a neural network architecture that significantly increases a model's total parameter count without proportionally increasing its computational cost during inference.
Instead of running every input through all the parameters in the network (a dense architecture), an MoE model is composed of multiple specialized sub-networks called "experts." A routing mechanism, or gating network, evaluates each incoming token and dynamically sends it to only the most relevant expert(s) for processing.
Why It Matters
Training massive AI models requires immense computational power. MoE allows labs to scale model capacity and reasoning ability to hundreds of billions or even trillions of parameters while keeping inference costs low. Because only a small fraction of the total parameters (the "active parameters") are used for any given token, an MoE model can run much faster and cheaper than a dense model of equivalent total size.
How It Works
In a standard Transformer, the feed-forward network (FFN) processes every token. In an MoE architecture, the FFN is replaced by a set of experts (e.g., 8 independent FFNs) and a router. When a token arrives, the router calculates a probability distribution to determine which experts are best suited to handle it. Typically, it routes the token to the top-k experts (often just 2 out of 8). The outputs from these selected experts are then combined to form the final result.
Example
Mistral Small 4 is a highly capable open-weights model built on a Mixture-of-Experts architecture. While it has a total of 119 billion parameters, it only uses 22 billion active parameters during inference for any given token. This sparse routing allows it to unify capabilities for complex reasoning, coding, and multimodal tasks while running efficiently enough to be deployed on local enterprise hardware.