The pursuit of unified scene understanding in multimodal intelligence hinges on effective Audio-Visual Large Language Models (AV-LLMs). However, conventional multi-task learning approaches for these models are plagued by substantial negative transfer, degrading performance on nearly 55% of tasks compared to single-task training. This stems from the inherent heterogeneity of audio-visual tasks, marked by varying granularity and distinct capability demands, leading to interference during joint training.
Tackling Task Heterogeneity with Explicit Cooperation
To overcome this critical bottleneck, the researchers introduce Crab$^{+}$, a scalable model designed for unified audio-visual scene understanding. Crab$^{+}$ tackles task heterogeneity through explicit cooperation at both the data and model levels. On the data front, they present AV-UIE v2, a comprehensive dataset featuring approximately 222K samples across 17 datasets and 7 tasks, augmented with explicit reasoning processes. This dataset enables the model to better capture cross-task relationships at diverse granularities. Complementing this, the model architecture features a unified interface to standardize task formulations and introduces Interaction-aware LoRA (I-LoRA). I-LoRA employs dynamic routing to explicitly model inter-task relationships, thereby coordinating distinct audio-visual interaction patterns and mitigating parameter interference.