Arcee Trinity Large Breaks Cover

Arcee.ai unveils Trinity Large, a 400B-parameter Mixture-of-Experts model engineered for inference efficiency and enterprise long-context use, alongside smaller variants.

Feb 22 at 9:20 PM2 min read
Abstract representation of a neural network with nodes and connections, symbolizing the Arcee Trinity Large MoE architecture.

Arcee.ai has unveiled its Trinity family of open-weight Mixture-of-Experts (MoE) language models, highlighted by the flagship Arcee Trinity Large. This new generation of LLMs emphasizes inference-time efficiency and long-context capabilities, targeting enterprise deployments with a focus on auditability and data provenance.

Trinity Models: Scale and Efficiency

The Trinity lineup includes Trinity Nano (6B total parameters, 1B activated per token), Trinity Mini (26B total, 3B activated), and the formidable Trinity Large (400B total, 13B activated). These models feature a modern architecture that combines interleaved local and global attention, gated attention, and a depth-scaled sandwich norm. All models were trained using the Muon optimizer, achieving zero loss spikes throughout their extensive pre-training.

Trinity Nano and Mini processed 10 trillion tokens, while Trinity Large was pre-trained on an impressive 17 trillion tokens. Arcee.ai has made the model checkpoints publicly available on Hugging Face, underscoring their commitment to open-weight foundations.

Architectural Innovations

Key to the Trinity family's design is a highly sparse Mixture-of-Experts layer. Trinity Large introduces Soft-clamped Momentum Expert Bias Updates (SMEBU), a novel load balancing strategy designed to mitigate router instability during training. This approach replaces traditional sign-based updates with a tanh soft-clamped, momentum-smoothed mechanism, allowing for more precise convergence and enhanced stability.

The models also employ a custom 200,000-token BPE tokenizer, optimized for numerical and multilingual text. Its pretokenization pipeline isolates digits for place-aligned chunking and handles script-aware isolation for languages like CJK and Thai, aiming for superior compression and arithmetic performance.

Pre-training and Data Strategy

DatologyAI curated the extensive pre-training data, which included 8 trillion tokens of synthetic web, code, and STEM data. This multi-phase training strategy progressively shifted towards higher-quality, domain-specific content, emphasizing programming, STEM, and reasoning skills, alongside broad multilingual coverage.

To address potential intra-batch correlation during training, Arcee.ai implemented the Random Sequential Document Buffer (RSDB). This method aims to stabilize training by reducing domain biases in minibatches, a critical factor as models scale and become more data-efficient.