Mistral AI has launched Vox-Trainer, a novel multimodal audio chat model designed to comprehend and generate both spoken audio and text. This release signifies a significant step forward in the company's efforts to provide advanced AI solutions for enterprises, particularly in areas requiring sophisticated audio processing and generation capabilities.
Introducing Vox-Trainer: A Multimodal Approach
Vox-Trainer is built on a Transformer architecture, comprising three core components: an audio encoder, an adapter layer for downsampling audio embeddings, and a language decoder for reasoning and text output generation. This architecture allows the model to process speech input, handle audio files up to 40 minutes in duration, and engage in long multi-turn conversations. The multimodal nature of Vox-Trainer enables it to bridge the gap between audio and text data, opening up new possibilities for AI-driven applications.
