Fine-tuning end-to-end Vision-Language-Action (VLA) models for robotic manipulation often degrades their inherent reasoning prowess. This inherent trade-off necessitates a novel approach to bridge the gap between high-level understanding and low-level execution.
Decomposing Intelligence: VLM Reasoning Meets Specialized Action Experts
The core innovation of HiVLA lies in its explicit decoupling of semantic planning from motor control. A VLM planner handles task decomposition and visual grounding, outputting structured plans with subtask instructions and target bounding boxes. This preserves the VLM's powerful zero-shot reasoning capabilities, a critical advantage for adaptability in robotics.
Cascaded Cross-Attention for Precision Motor Control
Translating these plans into physical actions is managed by a novel flow-matching Diffusion Transformer (DiT) action expert. This component features a unique cascaded cross-attention mechanism. It sequentially fuses global context, high-resolution object-centric crops, and skill semantics, enabling the DiT to focus on robust, fine-grained execution. This architecture allows for independent improvement of both the reasoning and execution modules, a significant advancement for HiVLA robotic manipulation.