Stefano Fiorucci, an AI/Software Engineer and Explorer known for his work on open-source AI orchestration at deepset, recently presented a compelling case for leveraging reinforcement learning (RL) in training large language models (LLMs). Fiorucci highlighted the limitations of traditional pre-training and supervised fine-tuning methods, emphasizing the need for LLMs to interact with environments to develop more robust reasoning and problem-solving capabilities.
The core idea revolves around the concept of "letting LLMs wander" in well-defined environments. This approach allows models to learn through trial and error, receiving rewards or penalties based on their actions and the resulting states. This is fundamentally different from supervised fine-tuning, which relies on curated datasets of prompt-response pairs.
Reinforcement Learning Fundamentals
Fiorucci began by outlining the basic RL loop: an agent (the LLM) interacts with an environment. The agent observes the current state, takes an action, receives a reward from the environment, and transitions to a new state. The goal of the agent is to maximize its cumulative reward over time, a process that balances exploration of new strategies with exploitation of known successful ones.
He contrasted this with classic LLM training, which typically involves three phases: pre-training on a large corpus of text to build a foundational model, supervised fine-tuning (SFT) on instruction-completion pairs to align the model with user intent, and preference optimization to further refine the model's behavior based on human feedback.
