The latest release of OLMo 2 positions it as the leading fully open language model to date, delivering performance that rivals and in some cases exceeds open-weight models, of comparable size.
With the introduction of their 7B and 13B parameter models trained on up to 5 trillion tokens, Ai2 says OLMo 2 achieves state-of-the-art efficiency and performance across a range of benchmarks.
In evaluations, OLMo-2-7B outperforms Llama-3.1-8B, while OLMo-2-13B beats Qwen-2.5-7B, despite requiring fewer computational resources for training. These results place OLMo 2 on the Pareto frontier for open models, balancing computational efficiency with high benchmark scores.
Key Results and Comparisons of OLMO 2
The OLMo 2 models excel across both familiar development benchmarks, such as ARC Challenge and HellaSwag, and unseen evaluation metrics, including AGIEval and GSM8k. Notably:
- OLMo-2-7B matches or surpasses larger models, proving highly efficient relative to training FLOPs.
- OLMo-2-13B, tuned for instruction tasks, outperforms competitors like Qwen-2.5-14B in instruction-following and reasoning tasks.
By combining high performance with complete transparency—releasing weights, datasets, training code, and recipes—OLMo 2 continues the trend of narrowing the gap between open and proprietary models.
A Focus on Stability and Efficiency
The improved performance of OLMo 2 stems from iterative improvements in model training and post-training processes:
- Stability in Long Training Runs: To address challenges like loss spikes that can degrade model performance, OLMo 2 introduces techniques for stabilizing gradients and maintaining consistent training progress.
- Staged Curriculum Training: A two-stage pretraining approach begins with diverse, large-scale datasets like OLMo-Mix-1124 and shifts to curated high-quality domain-specific datasets in the second stage. This ensures strong generalization and domain expertise.
- Advanced Fine-Tuning Techniques: Using methods from the recently released Tülu 3 family, OLMo 2 incorporates supervised fine-tuning, preference modeling, and reinforcement learning to enhance instruction-following capabilities.
The Tülu 3 model family from Ai2 offers a fully open-source approach to instruction-following language models, sharing comprehensive data, code, and recipes. Designed to address limitations in transparency within post-training, Tülu 3 provides a scalable codebase for techniques like supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR), alongside a standardized evaluation suite.
Key innovations include careful data curation, a four-stage post-training process, and a focus on core capabilities like reasoning, mathematics, coding, and safety. Built on Llama 3 Base, Tülu 3 achieves state-of-the-art performance across various skill evaluations, rivaling both open and closed models of similar size.
Benchmark Performance and Fully Open Weights and Datasets
The OLMo 2 family demonstrates consistent improvements across a suite of 20 evaluation benchmarks, addressing core tasks like knowledge recall, reasoning, and mathematical problem-solving. Compared to earlier models, such as OLMo-0424, the gains are particularly noticeable in general reasoning and domain-specific benchmarks.
Beyond its performance, OLMo 2 stands out for its commitment to transparency. Unlike many open-weight models, which release only final checkpoints, OLMo 2 provides complete access to weights, datasets, intermediate checkpoints, and training recipes. This level of openness allows researchers and developers to fully inspect, replicate, and build on the work.
With the release of OLMo 2, the open AI ecosystem takes another step forward, notably with transparency.

