Vision-language models (VLMs) are increasingly tasked with guiding robots, but they often falter on complex, multi-step tasks. A core challenge, detailed by Microsoft Research in their work on GroundedPlanBench, lies in simultaneously determining both *what* action to perform and *where* it should occur. Existing methods often decouple these decisions, leading to error propagation, especially with ambiguous natural language instructions.
To address this, GroundedPlanBench evaluates VLMs on their ability to plan actions and ground them spatially across diverse real-world robot scenarios. The team also developed Video-to-Spatially Grounded Planning (V2GP), a framework that transforms robot demonstration videos into spatially grounded training data. This enables models to learn both planning and spatial grounding concurrently.
