The frontier of autonomous driving is shifting from generalized scene understanding to personalized, instruction-driven navigation. Existing vision-language-action models primarily leverage language for high-level scene context, falling short in accommodating diverse user commands for tailored driving experiences. This gap highlights a critical need for systems that can interpret and execute nuanced human instructions.
Bridging Language and Action for Personalized Control
To address this, researchers introduced the Vega vision-language-action model, a unified framework designed for instruction-based generation and planning in autonomous driving. A key innovation is the development of InstructScene, a large-scale dataset comprising approximately 100,000 driving scenes meticulously annotated with a wide array of driving instructions and their corresponding trajectories. This dataset is foundational for training models capable of understanding and acting upon personalized user directives.
Hybrid Paradigm for World Modeling and Trajectory Generation
Vega employs a sophisticated, hybrid architectural approach. It utilizes an autoregressive paradigm to process sequential visual inputs and language instructions, enabling a deep understanding of the driving environment and user intent. Crucially, it integrates a diffusion paradigm for generating future predictions (world modeling) and action sequences (trajectory generation). This dual-paradigm strategy, coupled with joint attention mechanisms for cross-modal interaction and individual projection layers for enhanced modality-specific capabilities, allows the Vega vision-language-action model to achieve superior planning performance and robust instruction-following abilities. The research signifies a significant step towards more intelligent and adaptable autonomous driving systems.