Preferred on Google

Gemini's Audio Stack: From Transcription to Music Generation

Google DeepMind's Thor Schaeff explores Gemini's audio stack, from advanced transcription to music generation with Lyria 3.

Jun 9 at 5:02 PM7 min read

Thor Schaeff of Google DeepMind presenting on Gemini's audio stack — Thor Schaeff from Google DeepMind presenting on the Gemini audio stack.· AI Engineer

Visual TL;DR. Gemini Audio Stack enables Advanced Transcription. Advanced Transcription supports Multimodal Interactions. Gemini Audio Stack integrates Lyria 3. Lyria 3 powers Music Generation. Gemini Audio Stack measured by Performance Benchmarking. Music Generation leads to Future of AI Audio. Multimodal Interactions shapes Future of AI Audio.

Gemini Audio Stack: Google DeepMind's Thor Schaeff explores audio capabilities
Advanced Transcription: high accuracy speech, emotion, language variations, multiple speakers
Multimodal Interactions: seamless handling of diverse languages, dialects, and accents
Lyria 3: AI-generated music with advanced capabilities
Music Generation: creating original music with AI
Performance Benchmarking: evaluating Gemini's audio processing power
Future of AI Audio: pushing boundaries in sound processing and creation

Visual TL;DRQuickExplainDeeper

Thor Schaeff, a Developer Relations Engineer at Google DeepMind, recently provided an in-depth look at Gemini's audio stack, showcasing the evolving capabilities of AI in handling and generating sound. The presentation, titled "From Transcription to Live Music: Gemini's Audio Stack," offered a comprehensive overview of how Google DeepMind is pushing the boundaries in AI audio processing.

Gemini's Audio Stack: From Transcription to Music Generation - AI Engineer — Gemini's Audio Stack: From Transcription to Music Generation — from AI Engineer

Understanding Gemini's Audio Capabilities

Schaeff began by highlighting Gemini's core strengths in audio processing, emphasizing its ability to not only transcribe speech with high accuracy but also to understand nuances like emotion, language variations, and even multiple speakers in a conversation. The Gemini API is designed to process a broad spectrum of audio inputs, aiming for seamless handling of diverse languages, dialects, and accents. This sophisticated understanding extends to identifying and labeling emotions within speech, adding a layer of depth to the AI's comprehension.

Multimodal Realtime Interactions

A significant focus was placed on Gemini's multimodal capabilities, particularly its real-time interaction features. Schaeff demonstrated how the Gemini API can process text, audio, and video inputs simultaneously, enabling more dynamic and interactive AI agents. This is facilitated through a WebSocket connection, allowing for low-latency communication between applications and the AI model. The presentation touched upon the AI Studio, a platform where developers can experiment with these models, including selecting different voices and adjusting parameters like media resolution and thinking level, to create custom AI experiences.

AI-Generated Music with Lyria 3

The presentation also introduced Lyria 3, Google's AI model for music generation. This powerful tool can create music based on textual descriptions, allowing users to specify genre, mood, and instrumentation. Lyria 3 comes in two variants: Lyria 3 Clip for short audio clips and loops, and Lyria 3 Pro for longer, more complex musical compositions with verses and choruses. This advancement signifies a significant step towards AI's creative potential in the music industry, enabling rapid prototyping and novel musical exploration.

Benchmarking Gemini's Audio Performance

Schaeff presented benchmark data illustrating Gemini's performance in audio tasks. In the "ComplexFuncBench audio" benchmark, Gemini 3.5 Flash Live achieved a leading 90.8% accuracy in function calling, surpassing other models like Gemini 2.5 Flash Native Audio and Gemini 2.0 Flash Native Audio. Similarly, in the "Big Bench Audio" benchmark for speech reasoning, Gemini 3.5 Flash Live also demonstrated superior performance with 95.9% accuracy. These benchmarks highlight the effectiveness and advancement of Google's audio AI stack.

The Future of AI Audio

The session concluded with a look at the future potential of AI in audio, emphasizing the ongoing development and integration of these technologies. Schaeff pointed to resources like the Gemini API examples on GitHub and the AI Studio's live playground as avenues for developers to explore and build upon these capabilities. The demonstration of the "Live Jukebox" further illustrated the practical application of these models, showcasing real-time music generation based on user prompts.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Thor Schaeff #Google DeepMind #Gemini API #Lyria 3 #AI Audio #Music Generation #Transcription #Multimodal AI #Google AI Studio