Gemini's Audio Stack: From Transcription to Music Generation

Google DeepMind's Thor Schaeff explores Gemini's audio stack, from advanced transcription to music generation with Lyria 3.

7 min read
Thor Schaeff of Google DeepMind presenting on Gemini's audio stack
Thor Schaeff from Google DeepMind presenting on the Gemini audio stack.· AI Engineer

Thor Schaeff, a Developer Relations Engineer at Google DeepMind, recently provided an in-depth look at Gemini's audio stack, showcasing the evolving capabilities of AI in handling and generating sound. The presentation, titled "From Transcription to Live Music: Gemini's Audio Stack," offered a comprehensive overview of how Google DeepMind is pushing the boundaries in AI audio processing.

Gemini's Audio Stack: From Transcription to Music Generation - AI Engineer
Gemini's Audio Stack: From Transcription to Music Generation — from AI Engineer

Visual TL;DR. Gemini Audio Stack enables Advanced Transcription. Advanced Transcription supports Multimodal Interactions. Gemini Audio Stack integrates Lyria 3. Lyria 3 powers Music Generation. Gemini Audio Stack measured by Performance Benchmarking. Music Generation leads to Future of AI Audio. Multimodal Interactions shapes Future of AI Audio.

  1. Gemini Audio Stack: Google DeepMind's Thor Schaeff explores audio capabilities
  2. Advanced Transcription: high accuracy speech, emotion, language variations, multiple speakers
  3. Multimodal Interactions: seamless handling of diverse languages, dialects, and accents
  4. Lyria 3: AI-generated music with advanced capabilities
  5. Music Generation: creating original music with AI
  6. Performance Benchmarking: evaluating Gemini's audio processing power
  7. Future of AI Audio: pushing boundaries in sound processing and creation
Visual TL;DR
Visual TL;DR — startuphub.ai Gemini Audio Stack enables Advanced Transcription. Gemini Audio Stack integrates Lyria 3. Lyria 3 powers Music Generation. Music Generation leads to Future of AI Audio enables integrates powers leads to Gemini Audio Stack Advanced Transcription Lyria 3 Music Generation Future of AI Audio From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Gemini Audio Stack enables Advanced Transcription. Gemini Audio Stack integrates Lyria 3. Lyria 3 powers Music Generation. Music Generation leads to Future of AI Audio enables integrates powers leads to Gemini AudioStack AdvancedTranscription Lyria 3 Music Generation Future of AIAudio From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Gemini Audio Stack enables Advanced Transcription. Gemini Audio Stack integrates Lyria 3. Lyria 3 powers Music Generation. Music Generation leads to Future of AI Audio enables integrates powers leads to Gemini Audio Stack Google DeepMind's Thor Schaeff exploresaudio capabilities Advanced Transcription high accuracy speech, emotion, languagevariations, multiple speakers Lyria 3 AI-generated music with advancedcapabilities Music Generation creating original music with AI Future of AI Audio pushing boundaries in sound processing andcreation From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Gemini Audio Stack enables Advanced Transcription. Gemini Audio Stack integrates Lyria 3. Lyria 3 powers Music Generation. Music Generation leads to Future of AI Audio enables integrates powers leads to Gemini AudioStack Google DeepMind'sThor Schaeffexplores audio… AdvancedTranscription high accuracyspeech, emotion,language… Lyria 3 AI-generated musicwith advancedcapabilities Music Generation creating originalmusic with AI Future of AIAudio pushing boundariesin sound processingand creation From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Gemini Audio Stack enables Advanced Transcription. Advanced Transcription supports Multimodal Interactions. Gemini Audio Stack integrates Lyria 3. Lyria 3 powers Music Generation. Gemini Audio Stack measured by Performance Benchmarking. Music Generation leads to Future of AI Audio. Multimodal Interactions shapes Future of AI Audio enables supports integrates powers measured by leads to shapes Gemini Audio Stack Google DeepMind's Thor Schaeff exploresaudio capabilities Advanced Transcription high accuracy speech, emotion, languagevariations, multiple speakers Multimodal Interactions seamless handling of diverse languages,dialects, and accents Lyria 3 AI-generated music with advancedcapabilities Music Generation creating original music with AI Performance Benchmarking evaluating Gemini's audio processing power Future of AI Audio pushing boundaries in sound processing andcreation From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Gemini Audio Stack enables Advanced Transcription. Advanced Transcription supports Multimodal Interactions. Gemini Audio Stack integrates Lyria 3. Lyria 3 powers Music Generation. Gemini Audio Stack measured by Performance Benchmarking. Music Generation leads to Future of AI Audio. Multimodal Interactions shapes Future of AI Audio enables supports integrates powers measured by leads to shapes Gemini AudioStack Google DeepMind'sThor Schaeffexplores audio… AdvancedTranscription high accuracyspeech, emotion,language… MultimodalInteractions seamless handlingof diverselanguages,… Lyria 3 AI-generated musicwith advancedcapabilities Music Generation creating originalmusic with AI PerformanceBenchmarking evaluating Gemini'saudio processingpower Future of AIAudio pushing boundariesin sound processingand creation From startuphub.ai · The publishers behind this format

Understanding Gemini's Audio Capabilities

Schaeff began by highlighting Gemini's core strengths in audio processing, emphasizing its ability to not only transcribe speech with high accuracy but also to understand nuances like emotion, language variations, and even multiple speakers in a conversation. The Gemini API is designed to process a broad spectrum of audio inputs, aiming for seamless handling of diverse languages, dialects, and accents. This sophisticated understanding extends to identifying and labeling emotions within speech, adding a layer of depth to the AI's comprehension.

Related startups

Multimodal Realtime Interactions

A significant focus was placed on Gemini's multimodal capabilities, particularly its real-time interaction features. Schaeff demonstrated how the Gemini API can process text, audio, and video inputs simultaneously, enabling more dynamic and interactive AI agents. This is facilitated through a WebSocket connection, allowing for low-latency communication between applications and the AI model. The presentation touched upon the AI Studio, a platform where developers can experiment with these models, including selecting different voices and adjusting parameters like media resolution and thinking level, to create custom AI experiences.

AI-Generated Music with Lyria 3

The presentation also introduced Lyria 3, Google's AI model for music generation. This powerful tool can create music based on textual descriptions, allowing users to specify genre, mood, and instrumentation. Lyria 3 comes in two variants: Lyria 3 Clip for short audio clips and loops, and Lyria 3 Pro for longer, more complex musical compositions with verses and choruses. This advancement signifies a significant step towards AI's creative potential in the music industry, enabling rapid prototyping and novel musical exploration.

Benchmarking Gemini's Audio Performance

Schaeff presented benchmark data illustrating Gemini's performance in audio tasks. In the "ComplexFuncBench audio" benchmark, Gemini 3.5 Flash Live achieved a leading 90.8% accuracy in function calling, surpassing other models like Gemini 2.5 Flash Native Audio and Gemini 2.0 Flash Native Audio. Similarly, in the "Big Bench Audio" benchmark for speech reasoning, Gemini 3.5 Flash Live also demonstrated superior performance with 95.9% accuracy. These benchmarks highlight the effectiveness and advancement of Google's audio AI stack.

The Future of AI Audio

The session concluded with a look at the future potential of AI in audio, emphasizing the ongoing development and integration of these technologies. Schaeff pointed to resources like the Gemini API examples on GitHub and the AI Studio's live playground as avenues for developers to explore and build upon these capabilities. The demonstration of the "Live Jukebox" further illustrated the practical application of these models, showcasing real-time music generation based on user prompts.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.