LLMs Learn to Play Tic-Tac-Toe with Reinforcement Learning

Stefano Fiorucci discusses the power of reinforcement learning for training LLMs, showcasing Tic-Tac-Toe as a case study for building interactive environments and improving model capabilities.

4 min read
LLMs Learn to Play Tic-Tac-Toe with Reinforcement Learning
AI Engineer

Stefano Fiorucci, an AI/Software Engineer and Explorer known for his work on open-source AI orchestration at deepset, recently presented a compelling case for leveraging reinforcement learning (RL) in training large language models (LLMs). Fiorucci highlighted the limitations of traditional pre-training and supervised fine-tuning methods, emphasizing the need for LLMs to interact with environments to develop more robust reasoning and problem-solving capabilities.

LLMs Learn to Play Tic-Tac-Toe with Reinforcement Learning - AI Engineer
LLMs Learn to Play Tic-Tac-Toe with Reinforcement Learning — from AI Engineer

The core idea revolves around the concept of "letting LLMs wander" in well-defined environments. This approach allows models to learn through trial and error, receiving rewards or penalties based on their actions and the resulting states. This is fundamentally different from supervised fine-tuning, which relies on curated datasets of prompt-response pairs.

Reinforcement Learning Fundamentals

Fiorucci began by outlining the basic RL loop: an agent (the LLM) interacts with an environment. The agent observes the current state, takes an action, receives a reward from the environment, and transitions to a new state. The goal of the agent is to maximize its cumulative reward over time, a process that balances exploration of new strategies with exploitation of known successful ones.

He contrasted this with classic LLM training, which typically involves three phases: pre-training on a large corpus of text to build a foundational model, supervised fine-tuning (SFT) on instruction-completion pairs to align the model with user intent, and preference optimization to further refine the model's behavior based on human feedback.

Related startups

The Rise of RL Environments for LLMs

Fiorucci noted that the era of pure pre-training is showing its limits, particularly as data availability becomes a bottleneck. While SFT has been successful in improving instruction-following capabilities, RL offers a more dynamic learning mechanism. He cited Andrej Karpathy's sentiment that building diverse RL environments is the highest leverage activity for eliciting sophisticated LLM cognitive strategies.

The talk showcased DeepSeek-V3.2 as an example of LLMs pushing the frontier with RL, generating over 1,800 distinct environments and 85,000 complex prompts to enhance reasoning and tool-use capabilities. This approach allows LLMs to learn from interaction, explore possibilities, and refine their strategies based on feedback.

DeepSeek-R1: Incentivizing Reasoning via RL

The DeepSeek-R1 paper, which Fiorucci referenced, focuses on incentivizing reasoning capabilities in LLMs through RL. It utilizes techniques like Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO). RLVR, in particular, aims to provide verifiable, objective rewards by comparing the model's output against ground truth, a crucial step in ensuring reliable learning.

Fiorucci explained the concept of Reinforcement Learning with Verifiable Rewards (RLVR) using a Tic-Tac-Toe example. In this setup, the LLM is prompted to play the game. Its response (move and answer) is evaluated by a deterministic verifier that compares it against the ground truth. This comparison yields a reward, which is then used by the RL optimizer to update the model's policy. This method allows for more objective and scalable reward signals compared to human preference data.

Verifiers: Environments as Software Artifacts

To facilitate the creation of these RL environments, Fiorucci highlighted the "Verifiers" toolkit, an open-source project designed to build RL environments for LLMs. Verifiers treats environments as software artifacts, allowing for modularity and reusability. Key features include:

  • Evaluation and training capabilities
  • Environments packaged as Python libraries
  • Pre-built environment types
  • Parsing and reward handling
  • Compatibility with OpenAI-compatible APIs
  • Asynchronous and parallel execution
  • Support for various training frameworks like PRIME-RL, Tinker, SkyRL, and others

This toolkit abstracts away much of the complexity, enabling developers to focus on the core task and reward logic. Fiorucci demonstrated a single-turn environment for reversing text, where the model's output is evaluated based on its similarity to the reversed target text.

Tic-Tac-Toe Environment and Training

The presentation then dove into a practical example of training an LLM to play Tic-Tac-Toe. Fiorucci detailed the implementation of a multi-turn environment for the game, emphasizing its suitability for RL training due to its simple rules, deterministic solution, and the inherent challenge it poses to LLMs in managing multi-turn interactions and varying opponent responses.

He showcased the initial implementation of the Tic-Tac-Toe environment, including the `setup_state` and `env_response` functions that manage the game's state and generate responses. The training process involved synthetic data generation for SFT, followed by RL training using Group-based RL with seeds and stratified sampling to manage noise and improve stability.

The evaluation results presented showed a significant improvement in the fine-tuned model's performance compared to the base model. The fine-tuned model achieved a much higher win rate against both random and optimal opponents and demonstrated a near-perfect adherence to the required output format, indicating the effectiveness of the RL training methodology.

Looking Ahead

Fiorucci concluded by emphasizing the potential of RL environments for developing LLMs with more sophisticated reasoning and strategic capabilities. The availability of tools like Verifiers and platforms like the Environments Hub are crucial for democratizing this research and accelerating progress in the field.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.