HiVLA: Decoupling Reasoning for Robotic Control

Decomposing Intelligence: VLM Reasoning Meets Specialized Action Experts

The core innovation of HiVLA lies in its explicit decoupling of semantic planning from motor control. A VLM planner handles task decomposition and visual grounding, outputting structured plans with subtask instructions and target bounding boxes. This preserves the VLM's powerful zero-shot reasoning capabilities, a critical advantage for adaptability in robotics.

Cascaded Cross-Attention for Precision Motor Control

Translating these plans into physical actions is managed by a novel flow-matching Diffusion Transformer (DiT) action expert. This component features a unique cascaded cross-attention mechanism. It sequentially fuses global context, high-resolution object-centric crops, and skill semantics, enabling the DiT to focus on robust, fine-grained execution. This architecture allows for independent improvement of both the reasoning and execution modules, a significant advancement for HiVLA robotic manipulation.

Empirical Validation: Outperforming End-to-End Baselines

Extensive experiments across simulation and real-world scenarios demonstrate HiVLA's superiority over state-of-the-art end-to-end baselines. The framework particularly excels in complex scenarios involving long-horizon skill composition and the precise manipulation of small objects within cluttered environments, showcasing its robust capabilities in challenging HiVLA robotic manipulation tasks.