Yoko Li's TetrisBench project, unveiled February 23, 2026, began as a simple curiosity: pitting LLMs against humans in Tetris. The objective wasn't merely to win, but to observe how a familiar optimization challenge would be approached by systems reasoning fundamentally differently from humans. Tetris, with its structured data and clear, turn-based mechanics, presented an ideal environment for evaluating reasoning-first AI, rather than perception.
The Initial Flop
Early attempts with frontier models like Opus 4.5, GPT 5.2, and Gemini 3 Flash proved disastrous. Direct input of board states, encoded as JSON, yielded inconsistent, nonsensical, and self-destructive moves. High latency further hampered gameplay. Language models, untrained for step-by-step spatial planning over evolving states, struggled with this direct reasoning approach.
The Code-First Breakthrough
The pivot came by reframing Tetris as a coding problem. Instead of selecting moves, the LLM's task became generating a deterministic scoring function. This function would evaluate all legal placements, selecting the optimal move, with the model deciding when to update its logic based on board state. This shift transformed Tetris into a stable, functional optimization loop.
From Play to Benchmark
With a reliable system, TetrisBench scaled to extensive model-vs-model matches. Hundreds of games, run with identical piece sequences and constraints, revealed more than just win rates. Analysis uncovered profound differences in long-horizon optimization, with models exhibiting distinct playstyles.
Some models played aggressively, prioritizing early line clears at the cost of taller stacks. Others were notably conservative, maintaining flatter boards for survivability. These implicit optimization horizons emerged without explicit prompting. Gemini 3 Pro, for instance, achieved a 62.0% win rate with minimal interventions (1.22 updates/game), while Gemini 3 Flash, with a 60.3% win rate, intervened far more frequently (2.68/game). This highlighted how intervention frequency and average points per move could distinguish local optimizers from those implicitly reasoning over longer horizons. The project suggests these behavioral patterns can inform LLM strategy evaluation.
Humans vs. LLMs: The Edge Cases
Head-to-head human-LLM contests underscored fundamental strategic divergences. Models produced meticulously clean early boards but struggled when conditions deviated from their optimized heuristics. Humans, conversely, embraced "controlled irregularity," making locally suboptimal moves to maintain long-term flexibility and recover from messy states.
While LLMs consistently outperform most average human players, top Tetris players still defeat frontier models. This gap isn't about reaction time; it emerges in "off-distribution" scenarios—unusual board states or awkward piece sequences—where human intuition excels at forcing the game into regimes beyond the AI's programmed logic. Competitive player TAFOKINTS demonstrated this, beating Claude Opus 4.5 by building boards with significant bumpiness but few holes, a "controlled chaos" that effectively broke the AI's evaluation functions.
Key Insights from TetrisBench
The TetrisBench project clarified how data representation fundamentally shapes what is measured, shifting the focus from perception to strategic planning. It demonstrated that an LLM's optimization horizon is an observable behavior, not merely a prompt response. Furthermore, an LLM's decision to intervene and reconsider its strategy is a distinct and meaningful form of reasoning. This simple, classic game proved sufficient to reveal profound insights into how AI systems reason and adapt over time.



