Crab+ Unifies AV-LLMs, Reverses Negative Transfer

Crab+ introduces a novel approach to Audio-Visual Large Language Models, overcoming negative transfer via explicit cooperation in data and model design.

Mar 5 at 8:01 PM2 min read
Diagram illustrating the Crab+ model architecture and data flow for audio-visual scene understanding.

The pursuit of unified scene understanding in multimodal intelligence hinges on effective Audio-Visual Large Language Models (AV-LLMs). However, conventional multi-task learning approaches for these models are plagued by substantial negative transfer, degrading performance on nearly 55% of tasks compared to single-task training. This stems from the inherent heterogeneity of audio-visual tasks, marked by varying granularity and distinct capability demands, leading to interference during joint training.

Tackling Task Heterogeneity with Explicit Cooperation

To overcome this critical bottleneck, the researchers introduce Crab$^{+}$, a scalable model designed for unified audio-visual scene understanding. Crab$^{+}$ tackles task heterogeneity through explicit cooperation at both the data and model levels. On the data front, they present AV-UIE v2, a comprehensive dataset featuring approximately 222K samples across 17 datasets and 7 tasks, augmented with explicit reasoning processes. This dataset enables the model to better capture cross-task relationships at diverse granularities. Complementing this, the model architecture features a unified interface to standardize task formulations and introduces Interaction-aware LoRA (I-LoRA). I-LoRA employs dynamic routing to explicitly model inter-task relationships, thereby coordinating distinct audio-visual interaction patterns and mitigating parameter interference.

Demonstrating Positive Transfer and Broad Applicability

The experimental results from Crab$^{+}$ are compelling. The model not only covers a broader range of tasks than existing unified AV-LLMs but also surpasses specialized models on various benchmarks. Crucially, Crab$^{+}$ reverses the negative transfer trend, achieving positive transfer where multi-task learning now outperforms single-task baselines in nearly 88% of evaluated tasks. These gains are consistent across diverse AV-LLM paradigms and are validated by in-depth visualizations, positioning Crab$^{+}$ as a robust advancement in holistic audio-visual scene understanding.