Deploying AI agents in real-world home environments presents a significant challenge: these agents must interpret complex, uncurated household scenes and situated requests, rather than relying on clean, predefined task specifications. This necessitates identifying relevant objects, understanding implicit conditions, and resolving action sequences from rich, often noisy, contextual information. The researchers tackle this by formalizing the capability as 'full-scene household reasoning,' where an agent must infer an executable task structure before generating a grounded action sequence. Direct prompting on complete scenes proves inefficient and error-prone, especially given the constraints of privacy and local compute that favor compact, open-weight models with limited long-context abilities. To address this, they propose TaskGround, a training-free and model-agnostic framework designed to ground complete scenes into compact, task-relevant slices, infer executable task structures, and compile these into actionable sequences.
Related startups
From Raw Scenes to Executable Task Structures
The core of TaskGround's innovation lies in its 'Ground-Infer-Execute' paradigm. It effectively distills the vast information within a complete household scene down to a manageable 'task-relevant scene slice.' This process is crucial for overcoming the limitations of current compact models, which struggle with the sheer volume of irrelevant data in full scenes. By first grounding the scene and then inferring the executable task structure, TaskGround creates a more focused input for the AI, enabling it to reason more effectively about the intended task.
A New Benchmark for Real-World Household AI
To rigorously evaluate this full-scene household reasoning capability, the authors introduce FullHome, a comprehensive, human-validated evaluation suite. This benchmark comprises 400 household tasks across diverse home environments, encompassing both goal-oriented and process-constrained requirements. The results on FullHome demonstrate TaskGround's significant impact, showing substantial improvements in task success rates across various proprietary and open-weight models. Notably, TaskGround empowers a compact model like Qwen3.5-9B to achieve performance competitive with larger models such as GPT-5, all while drastically reducing input token costs by up to 18x. This highlights the critical bottleneck of executable task-structure inference in household AI and showcases how structured grounding can unlock the potential of compact local models for practical deployment.