At Anthropic's Code w/ Claude event, David Hershey, a Member of the Technical Staff, rebooted a live stream of an AI playing a video game. With a countdown from the audience, "Claude Plays Pokémon" went live again, but this was more than just a playful gimmick. It was a carefully chosen demonstration of fundamental improvements in how AI models use tools to interact with the world.
David Hershey presented the project to showcase advancements in Anthropic's latest models. The core of the demonstration was to illustrate how new capabilities in planning and action-taking are turning large language models into more effective autonomous agents. The simple, objective-driven world of Pokémon provided a surprisingly clear benchmark for these complex capabilities.
A key advancement is the model's ability to perform "extended thinking between tool calls." Previously, models would generate a complete, often rigid plan before executing the first action. Now, Claude can plan a step, act, observe the result, and then reflect on that outcome to inform its next move. This agentic loop is crucial for long-horizon tasks. Hershey showed an example where older models would get confused by the simple name-entry screen in Pokémon, failing to understand how the cursor wrapped around the grid. The new model, however, can pause after a failed move, analyze the pattern of its previous actions, and deduce the correct logic to proceed.
This is a subtle but critical step toward more robust agents.
The second major improvement is efficiency through parallel tool calling. In the past, a model could only call one tool at a time, a limitation Hershey described as "frustratingly bad." If an agent needed to perform an action and then update its internal memory, it would require two separate, sequential calls to the model, increasing latency and cost. With the new models, Claude can call multiple tools at once. In the context of the game, this means it can press a button to advance dialogue while simultaneously calling another tool to update its knowledge base, making the agent faster and more efficient.
These technical upgrades directly address the core challenge of building useful agents. Hershey noted that "tool use is the driver of agents," and these improvements in planning and efficiency are what allow models to move beyond simple Q&A. By demonstrating these skills in a game, Anthropic provided a tangible look at how its models are becoming better equipped to tackle complex, multi-step problems in the real world.

