Skip to main content
AI engineering glossary

What is Mixture of Experts (MoE) in LLMs?

Mixture of Experts (MoE) is a neural network architecture where each forward pass routes tokens through a small subset of specialized 'expert' subnetworks rather than the entire model, keeping total parameter count high (giving model capacity) while activating only a fraction of parameters per token (saving inference compute). It is used in production by Mixtral, DeepSeek-V2/V3, GPT-4 (rumored), and others.

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Mixture of Experts (MoE) is one of the most consequential architectural shifts in production LLMs since the original Transformer. The idea is decades old but became practically important when Mixtral 8x7B (December 2023) demonstrated production-viable MoE at the open-source level, followed by DeepSeek-V2 (2024) and DeepSeek-V3 (2024) pushing the scale further. The economic case is compelling: a 47-billion-parameter MoE model that activates only 13 billion parameters per token can match or exceed the quality of a dense 30-billion-parameter model while having much lower inference cost. This has made MoE the architecture of choice for several recent frontier and open-source models.

How MoE works

In a standard dense Transformer, every token passes through every parameter at every layer. In an MoE Transformer, the dense feed-forward network in (some or all) layers is replaced by multiple parallel 'expert' networks plus a router. For each token at each MoE layer: the router computes which experts are most relevant for this token; the token is processed by only those experts (typically the top 1-2 of 8-32 experts); outputs from selected experts are combined. The total parameter count is high (sum of all experts) but compute per token is much lower (only selected experts run). The router is itself learned during training: it learns which experts to use for which kinds of tokens, leading to implicit specialization.

MoE economics: parameters vs compute

MoE decouples two things that are coupled in dense models: total parameter count (model capacity) and compute per token (inference cost). Mixtral 8x7B has 8 expert networks of ~7B parameters each, ~47B total parameters, but activates only 2 experts per token = ~13B active parameters per forward pass. Quality benchmarks place Mixtral 8x7B competitive with or above dense 30B models, while inference cost approximates a dense 13B model. DeepSeek-V3 takes this further: 671B total parameters, only ~37B active per token. The trade-off: MoE models need more memory to store all experts even when not all are used, and routing can be load-balanced unevenly (some experts get hot, others get cold). Production deployment requires expertise in MoE-aware serving (inference engines like vLLM and TensorRT-LLM have added MoE support).

MoE in frontier models

Several major models in 2024-2026 use MoE architectures. Mixtral 8x7B and 8x22B (Mistral, open weights) are widely used and benchmarked. DeepSeek-V2 and V3 (DeepSeek, open weights) demonstrated state-of-the-art quality at MoE scale. GPT-4 is widely rumored to use MoE (Sam Altman and others have hinted but not confirmed details). Gemini 1.5 Pro is reported to use MoE. Llama 4 (when released) is expected to include MoE variants. The trend is clear: at frontier scale, MoE is increasingly the default architecture because the inference economics are too good to ignore. For self-hosted or open-source-based production work, MoE models are now mainstream choices alongside dense models.

Use cases

  • Frontier-quality models with tractable inference cost
  • Self-hosted production deployments where dense model alternatives are too expensive to serve
  • Multi-tenant SaaS that needs high quality without per-query frontier-model cost
  • Domain-specialized applications where MoE's natural expert specialization aligns with task diversity
  • Cost-optimized batch workloads where MoE's active-parameter savings compound at scale

Examples in production

Mistral AI

Mixtral 8x7B (December 2023) and Mixtral 8x22B: production-quality open-weight MoE models that established MoE as a mainstream architecture choice.

Source

DeepSeek

DeepSeek-V2 (2024) and DeepSeek-V3 (2024) pushed MoE scale further: DeepSeek-V3 is 671B total parameters with ~37B active per token, matching frontier model quality at lower inference cost.

Source

Switch Transformer (Google, 2021)

Foundational MoE paper that demonstrated trillion-parameter MoE training; influenced the modern wave of production MoE models.

Source

Mixture of Experts compared to alternatives

AlternativeChoose Mixture of Experts whenChoose alternative when
Dense Transformer
Standard architecture where all parameters activate for every token
Use MoE at scale where inference cost matters more than memory costUse dense for smaller models or where memory is the binding constraint
Distillation
Training a smaller dense model on outputs from a larger model
Use MoE for high-capacity-low-active-compute trade-offUse distillation when total memory budget is tight

Common pitfalls

  • Assuming MoE is always cheaper: memory requirements are based on total parameters, not active parameters
  • Naive serving without MoE-aware inference engines: load balancing and expert routing matter operationally
  • Underestimating training complexity: training MoE models requires balancing expert utilization, which is non-trivial
  • Comparing MoE active parameters to dense model parameters: they're not directly equivalent on all tasks
  • Choosing MoE for small-scale workloads where dense models would be simpler and adequate
FAQ

Questions about Mixture of Experts.

On compute, yes: MoE models do less compute per token (only active experts run). On memory, no: MoE models need memory for all experts even when not all are active. So MoE wins on inference cost when compute is the bottleneck (most workloads), but doesn't help (and can hurt) when memory is binding.

Widely rumored but not officially confirmed by OpenAI. The leaks and hints from various sources suggest GPT-4 uses MoE; OpenAI hasn't published architecture details. Most informed observers treat GPT-4 as MoE in their analysis.

Self-hosting MoE is feasible with modern inference engines (vLLM and TensorRT-LLM both support MoE) but requires more expertise than dense model serving. Memory requirements are high (need to store all experts), GPU sizing is different, and load balancing across experts matters operationally. For most BearPlex clients, we recommend self-hosting MoE only when the cost advantage over managed inference services is substantial enough to justify the operational investment.

Continued growth at the frontier. The economic case is too strong to ignore: every additional active parameter at frontier scale costs significant inference money, and MoE's separation of capacity from active compute is the main lever for getting more capacity per inference dollar. Expect frontier models from major labs to increasingly use MoE; expect open-source models to follow.

Work with BearPlex

Need help implementing Mixture of Experts?

BearPlex builds production AI systems that use Mixture of Experts for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.