The drive for more capable Large Language Models (LLMs) often hits a wall: the data required to train advanced reasoning skills is difficult to acquire. Existing methods rely on extensive supervised fine-tuning (SFT) and reinforcement learning (RL) with high-quality reasoning data. However, challenges like the 'cold-start' problem (lack of initial reasoning examples), limited domain coverage (mostly math), and the sheer cost and difficulty of human annotation for complex tasks, hinder progress. Addressing these issues, researchers have introduced CHIMERA, a novel synthetic dataset aimed at enabling generalizable, cross-domain reasoning in LLMs.
The CHIMERA Approach
CHIMERA tackles the data scarcity problem head-on by generating a compact yet comprehensive synthetic dataset. The core idea is to leverage powerful existing LLMs to synthesize the complex reasoning trajectories needed for training. This synthetic dataset comprises 9,000 samples and is built with three key properties. First, it provides rich, long Chain-of-Thought (CoT) reasoning trajectories, crucial for understanding and replicating multi-step problem-solving processes. The importance of monitorable CoT reasoning is highlighted in discussions around AI’s safety net. Second, CHIMERA boasts broad and structured domain coverage, spanning 8 major scientific disciplines and over 1,000 fine-grained topics organized hierarchically. This structured approach ensures the model learns to reason across a wider spectrum of knowledge than typical datasets. Third, the dataset utilizes a fully automated, scalable evaluation pipeline. This pipeline employs strong reasoning models to cross-validate both the validity of the problems presented and the correctness of the generated answers, ensuring data quality without human bottlenecks. This approach to creating a synthetic dataset for AI training offers a scalable solution.