The dominance of Mixture-of-Experts (MoE) in massive language models has left its potential for sub-billion parameter, on-device deployments largely untapped. This gap is now being addressed by MobileMoE, a new family of on-device LLMs that push the boundaries of efficiency and performance on mobile hardware.
Related startups
On-Device MoE Scaling Laws Unlock Efficiency
The researchers formulated a novel on-device MoE scaling law, a critical step in jointly optimizing MoE architectures under strict mobile memory and compute constraints. This analysis identified a 'sweet spot' characterized by moderate sparsity, fine-grained, and shared experts. This configuration proves to be simultaneously memory and compute-optimal, a crucial breakthrough for practical mobile deployment. The resulting architectures, trained through a comprehensive four-stage recipe on open-source data, showcase the power of this tailored approach.
Surpassing Dense and Sparse Baselines in Performance
Across 14 benchmarks, MobileMoE models demonstrate remarkable capabilities. They not only match or exceed leading on-device dense LLMs but do so with 2-4$ imes$ fewer inference FLOPs. Furthermore, they rival or surpass the state-of-the-art MoE OLMoE-1B-7B, achieving this with up to 60% fewer parameters. This performance leap validates the MobileMoE LLM architecture as a superior choice for resource-constrained environments. The team's work, detailed on arXiv, also provides the first efficient MoE inference framework for commodity smartphones, including comprehensive on-device profiling.