Microsoft Research has unveiled AsgardBench, a new benchmark designed to rigorously test the ability of AI agents to plan and adapt tasks based on visual input. This development addresses a critical gap in evaluating embodied AI, which requires agents to interact with and understand their environment.
Unlike previous benchmarks that often bundle perception, navigation, and control, AsgardBench isolates the crucial aspect of visually grounded interactive planning. It challenges AI agents to adjust their actions in simulated household tasks when visual observations contradict their initial assumptions. This is vital for creating robots and AI systems capable of navigating the unpredictable real world.
The benchmark, detailed in the paper "AsgardBench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback," presents AI agents with 108 controlled task instances across 12 categories. The core idea is simple: give an agent a goal, let it see the environment, and observe if it can revise its plan when reality doesn't match expectations. For example, an agent tasked with cleaning a mug might find it already clean or the sink already full, requiring a plan modification.
Focus on Adaptation, Not Just Execution
Built upon the AI2-THOR simulation environment, AsgardBench provides agents with a limited action set (find, pickup, put, clean, toggle_on/off) and visual input. At each step, the agent proposes a full plan, but only the first action is executed. This forces continuous re-evaluation and adaptation, moving beyond static, pre-scripted behaviors. The focus is squarely on AI agent plan adaptation, not on basic navigation or object manipulation.
This approach means that identical instructions can necessitate different action sequences depending on the observed state of objects and the environment. The benchmark emphasizes the agent's capacity to notice subtle visual cues, like whether a mug is dirty or a faucet is running, and to maintain task context across multiple steps.
Visual Input Proves Crucial
Initial testing on AsgardBench revealed that leading vision-capable models significantly outperform text-only agents. Success rates more than doubled when visual input was provided, underscoring the necessity of perception for effective planning. While detailed textual feedback improved performance, it could mask underlying issues that visual grounding directly addresses.
The tests also pinpointed recurring failure patterns: agents attempted impossible actions, entered repetitive loops, misidentified subtle visual states (clean/dirty, on/off), and lost track of their progress. These shortcomings highlight key areas for improvement: finer visual discrimination in cluttered scenes, more robust state tracking across steps, and better integration of visual perception into real-time plan revision.
AsgardBench serves as both a diagnostic tool and a development accelerator for embodied AI benchmark research. By varying feedback levels, researchers can isolate performance bottlenecks. Future advancements will likely focus on systems with stronger visual understanding, improved state management, and training methods that prioritize mid-task plan repair. The open-source nature of AsgardBench on GitHub promises to accelerate progress in this vital area of AI research.
