Arcee Trinity Large Breaks Cover

Arcee.ai unveils Trinity Large, a 400B-parameter Mixture-of-Experts model engineered for inference efficiency and enterprise long-context use, alongside smaller variants.

2 min read
Arcee Trinity Large Breaks Cover

Arcee.ai has unveiled its Trinity family of open-weight Mixture-of-Experts (MoE) language models, highlighted by the flagship Arcee Trinity Large. This new generation of LLMs emphasizes inference-time efficiency and long-context capabilities, targeting enterprise deployments with a focus on auditability and data provenance.

Trinity Models: Scale and Efficiency

The Trinity lineup includes Trinity Nano (6B total parameters, 1B activated per token), Trinity Mini (26B total, 3B activated), and the formidable Trinity Large (400B total, 13B activated). These models feature a modern architecture that combines interleaved local and global attention, gated attention, and a depth-scaled sandwich norm. All models were trained using the Muon optimizer, achieving zero loss spikes throughout their extensive pre-training.

Trinity Nano and Mini processed 10 trillion tokens, while Trinity Large was pre-trained on an impressive 17 trillion tokens. Arcee.ai has made the model checkpoints publicly available on Hugging Face, underscoring their commitment to open-weight foundations.

Related startups

Architectural Innovations

Key to the Trinity family's design is a highly sparse Mixture-of-Experts layer. Trinity Large introduces Soft-clamped Momentum Expert Bias Updates (SMEBU), a novel load balancing strategy designed to mitigate router instability during training. This approach replaces traditional sign-based updates with a tanh soft-clamped, momentum-smoothed mechanism, allowing for more precise convergence and enhanced stability.

The models also employ a custom 200,000-token BPE tokenizer, optimized for numerical and multilingual text. Its pretokenization pipeline isolates digits for place-aligned chunking and handles script-aware isolation for languages like CJK and Thai, aiming for superior compression and arithmetic performance.

Pre-training and Data Strategy

DatologyAI curated the extensive pre-training data, which included 8 trillion tokens of synthetic web, code, and STEM data. This multi-phase training strategy progressively shifted towards higher-quality, domain-specific content, emphasizing programming, STEM, and reasoning skills, alongside broad multilingual coverage.

To address potential intra-batch correlation during training, Arcee.ai implemented the Random Sequential Document Buffer (RSDB). This method aims to stabilize training by reducing domain biases in minibatches, a critical factor as models scale and become more data-efficient.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.