Thinking Machines Lab has unveiled a significant advancement in large language model post-training, introducing on-policy distillation LLMs as a highly efficient method for developing specialized, high-performing smaller models. According to the announcement, this approach combines the direct learning benefits of on-policy training with the rich feedback of distillation, addressing critical limitations of existing methods. The innovation promises to make expert LLM capabilities more accessible and cost-effective for a wider range of applications. This development marks a pivotal shift in how developers can refine LLMs for specific tasks and domains.
Historically, post-training LLMs has involved a trade-off between two primary approaches: reinforcement learning (RL) and off-policy distillation. RL, an on-policy method, allows models to learn from their own generated trajectories, directly addressing their mistakes. However, its sparse feedback, often just a single win/loss signal per episode, makes it notoriously inefficient for complex tasks like multi-step reasoning. Conversely, off-policy distillation, typically supervised fine-tuning (SFT) on teacher-generated data, offers dense, token-level feedback. Its major flaw lies in the student model learning from contexts the teacher frequents, not necessarily the divergent states the student itself might encounter, leading to compounding errors and a potential imitation of style over substance.
