Robots Get Better at Long-Term Planning

Vision-language models (VLMs) are increasingly tasked with guiding robots, but they often falter on complex, multi-step tasks. A core challenge, detailed by Microsoft Research in their work on GroundedPlanBench, lies in simultaneously determining both *what* action to perform and *where* it should occur. Existing methods often decouple these decisions, leading to error propagation, especially with ambiguous natural language instructions.

To address this, GroundedPlanBench evaluates VLMs on their ability to plan actions and ground them spatially across diverse real-world robot scenarios. The team also developed Video-to-Spatially Grounded Planning (V2GP), a framework that transforms robot demonstration videos into spatially grounded training data. This enables models to learn both planning and spatial grounding concurrently.

Planning with Spatial Grounding

The GroundedPlanBench benchmark utilizes 308 robot manipulation scenes from the DROID dataset. Each scene is annotated with tasks described both explicitly and implicitly. For instance, an explicit instruction might be "put a spoon on the white plate," while an implicit one could be "tidy up the table." Each task is broken down into basic actions like grasp, place, open, and close, each tied to a specific visual location.

The V2GP framework leverages gripper signals from robot videos to identify object interactions. It then uses advanced segmentation models to track objects and construct grounded plans, pinpointing where an object was grasped and where it was placed. This approach yielded approximately 43,000 grounded plans of varying lengths.

Decoupled vs. Grounded Planning

Evaluating with models like Qwen3-VL, a capable VLM for robotics, highlighted the limitations of decoupled planning. When a VLM generates a plan and a separate model grounds it, ambiguity in the language leads to errors. For example, referring to "napkin on the table" can cause a model to pick the wrong napkin in a cluttered scene.

This issue is amplified in real-world robot manipulation tasks. By contrast, GroundedPlanBench and the V2GP framework demonstrate that integrating planning and grounding within a single model improves both task success and action accuracy. Training with V2GP data showed significant gains for models like Qwen3-VL, outperforming decoupled methods in both benchmark and real-world robot evaluations.

The research suggests that tightly coupling action and location decisions is crucial for reliable robot manipulation. While challenges remain with longer, multi-step tasks and implicit instructions, future work could combine grounded planning with world models for even more sophisticated robot reasoning.

Robots Get Better at Long-Term Planning

Planning with Spatial Grounding

Decoupled vs. Grounded Planning

AI Daily Digest