Egocentric video spatial question answering demands sophisticated reasoning over 3D object positions and scene affordances, a challenge amplified in the zero-shot setting. Current Vision-Language Models (VLMs) often falter without task-specific fine-tuning or access to 3D sensor data. This paper introduces SpatioRoute, a novel dynamic prompt generation approach that tailors prompts to incoming questions without any additional training or 3D inputs.
Related startups
Question-Aware Routing for Zero-Shot Efficiency
SpatioRoute operates through two complementary routing mechanisms. SpatioRoute-R employs a rule-based system to deterministically map question typologies (e.g., 'What', 'Is', 'How') to specialized prompt templates. Complementing this, SpatioRoute-L utilizes an LLM to generate task-specific prompts based solely on the question and situational context, crucially not requiring video input at the routing stage. This flexibility allows SpatioRoute VLM to adapt to diverse question types and contextual nuances, enhancing zero-shot capabilities.
Advancing Spatial Video Understanding Without 3D Data
Evaluated on the SQA3D benchmark across various VLM families, SpatioRoute consistently demonstrates accuracy gains of up to 5% compared to fixed prompt baselines. This establishes a new state-of-the-art for zero-shot video-only spatial VQA, notably without the need for 3D point-cloud inputs. Furthermore, the research highlights a critical finding: Chain-of-Thought (CoT) prompting, specifically with the Think it Twice architecture, actually degrades performance on Qwen series models in this context, underscoring the superiority of question-aware routing over uniform reasoning strategies for spatial video understanding.