UniPool: Rethinking MoE Efficiency

Modern Mixture-of-Experts (MoE) architectures impose a rigid, per-layer structure for expert allocation, leading to coupled depth scaling and parameter growth. This approach assumes each layer requires distinct expert capacity, a notion challenged by recent analyses. Evidence suggests a notable redundancy, where replacing deep-layer routing with random assignments yields minimal accuracy drops. This inefficiency is the core problem addressed by the UniPool MoE architecture, as detailed in research from Huang, Shi, Zheng, Wu, Chen, et al.

From Layered to Pooled Expertise

The UniPool MoE architecture fundamentally redefines expert capacity management. Instead of each transformer layer owning its dedicated set of experts, UniPool consolidates expert resources into a single, shared global pool. Independent per-layer routers then access this unified pool. This architectural shift decouples the growth of expert parameters from model depth, allowing for a more flexible and efficient distribution of computational resources. To ensure stable and balanced training within this shared framework, UniPool introduces a pool-level auxiliary loss designed to equalize expert utilization across the entire pool. Complementing this, NormRouter is employed to facilitate sparse and scale-stable routing to the shared experts.

Sub-Linear Parameter Growth and Performance Gains

The strategic advantage of the UniPool MoE architecture lies in its ability to achieve superior or equivalent performance with substantially reduced parameter budgets. Across multiple LLaMA-scale models (182M to 978M parameters) trained on 30 billion tokens, UniPool consistently outperformed vanilla MoE baselines in validation loss and perplexity, with reductions of up to 0.0386. Crucially, UniPool demonstrates that expert parameters need not scale linearly with depth. Variants utilizing only 41.6%-66.7% of the vanilla expert-parameter budget matched or surpassed layer-wise MoE performance at tested scales. This indicates that expert capacity can grow sub-linearly with depth when managed under a shared-pool design, offering a more efficient and effective path forward for MoE development, as per the findings on arXiv.

UniPool: Rethinking MoE Efficiency

From Layered to Pooled Expertise

Related startups

Sub-Linear Parameter Growth and Performance Gains

AI Daily Digest