Modern Mixture-of-Experts (MoE) architectures impose a rigid, per-layer structure for expert allocation, leading to coupled depth scaling and parameter growth. This approach assumes each layer requires distinct expert capacity, a notion challenged by recent analyses. Evidence suggests a notable redundancy, where replacing deep-layer routing with random assignments yields minimal accuracy drops. This inefficiency is the core problem addressed by the UniPool MoE architecture, as detailed in research from Huang, Shi, Zheng, Wu, Chen, et al.
From Layered to Pooled Expertise
The UniPool MoE architecture fundamentally redefines expert capacity management. Instead of each transformer layer owning its dedicated set of experts, UniPool consolidates expert resources into a single, shared global pool. Independent per-layer routers then access this unified pool. This architectural shift decouples the growth of expert parameters from model depth, allowing for a more flexible and efficient distribution of computational resources. To ensure stable and balanced training within this shared framework, UniPool introduces a pool-level auxiliary loss designed to equalize expert utilization across the entire pool. Complementing this, NormRouter is employed to facilitate sparse and scale-stable routing to the shared experts.