Microsoft's AsgardBench Tests AI's Planning Skills

Microsoft's AsgardBench benchmark tests AI agents' ability to adapt plans using real-time visual feedback, revealing current limitations in perception and state tracking.

3 min read
Microsoft's AsgardBench Tests AI's Planning Skills
Microsoft Reesarch

Microsoft Research has unveiled AsgardBench, a new benchmark designed to rigorously test the ability of AI agents to plan and adapt tasks based on visual input. This development addresses a critical gap in evaluating embodied AI, which requires agents to interact with and understand their environment.

Unlike previous benchmarks that often bundle perception, navigation, and control, AsgardBench isolates the crucial aspect of visually grounded interactive planning. It challenges AI agents to adjust their actions in simulated household tasks when visual observations contradict their initial assumptions. This is vital for creating robots and AI systems capable of navigating the unpredictable real world.

The benchmark, detailed in the paper "AsgardBench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback," presents AI agents with 108 controlled task instances across 12 categories. The core idea is simple: give an agent a goal, let it see the environment, and observe if it can revise its plan when reality doesn't match expectations. For example, an agent tasked with cleaning a mug might find it already clean or the sink already full, requiring a plan modification.

Focus on Adaptation, Not Just Execution

Built upon the AI2-THOR simulation environment, AsgardBench provides agents with a limited action set (find, pickup, put, clean, toggle_on/off) and visual input. At each step, the agent proposes a full plan, but only the first action is executed. This forces continuous re-evaluation and adaptation, moving beyond static, pre-scripted behaviors. The focus is squarely on AI agent plan adaptation, not on basic navigation or object manipulation.

This approach means that identical instructions can necessitate different action sequences depending on the observed state of objects and the environment. The benchmark emphasizes the agent's capacity to notice subtle visual cues, like whether a mug is dirty or a faucet is running, and to maintain task context across multiple steps.

Visual Input Proves Crucial

Initial testing on AsgardBench revealed that leading vision-capable models significantly outperform text-only agents. Success rates more than doubled when visual input was provided, underscoring the necessity of perception for effective planning. While detailed textual feedback improved performance, it could mask underlying issues that visual grounding directly addresses.

The tests also pinpointed recurring failure patterns: agents attempted impossible actions, entered repetitive loops, misidentified subtle visual states (clean/dirty, on/off), and lost track of their progress. These shortcomings highlight key areas for improvement: finer visual discrimination in cluttered scenes, more robust state tracking across steps, and better integration of visual perception into real-time plan revision.

AsgardBench serves as both a diagnostic tool and a development accelerator for embodied AI benchmark research. By varying feedback levels, researchers can isolate performance bottlenecks. Future advancements will likely focus on systems with stronger visual understanding, improved state management, and training methods that prioritize mid-task plan repair. The open-source nature of AsgardBench on GitHub promises to accelerate progress in this vital area of AI research.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.