The frontier of autonomous driving is shifting from generalized scene understanding to personalized, instruction-driven navigation. Existing vision-language-action models primarily leverage language for high-level scene context, falling short in accommodating diverse user commands for tailored driving experiences. This gap highlights a critical need for systems that can interpret and execute nuanced human instructions.
Bridging Language and Action for Personalized Control
To address this, researchers introduced the Vega vision-language-action model, a unified framework designed for instruction-based generation and planning in autonomous driving. A key innovation is the development of InstructScene, a large-scale dataset comprising approximately 100,000 driving scenes meticulously annotated with a wide array of driving instructions and their corresponding trajectories. This dataset is foundational for training models capable of understanding and acting upon personalized user directives.