Neil Zeghidour on Voice AI's 'Her' Moment

Gradium AI's Neil Zeghidour discusses the 'Her' moment in voice AI, highlighting challenges like latency and scalability, and showcasing Phonon, their on-device TTS model.

Neil Zeghidour presenting on Voice AI's 'Her' moment
Image credit: AI Engineer· AI Engineer

Neil Zeghidour, CEO and Co-founder of Gradium AI, recently discussed the evolution of voice AI and the long-anticipated "Her" moment, drawing parallels to the popular film where artificial intelligence achieves a deeply human-like conversational capability. Speaking at an AI Engineer event, Zeghidour explored the current state of voice AI, the challenges that remain, and the potential future advancements.

Neil Zeghidour on Voice AI's 'Her' Moment - AI Engineer
Neil Zeghidour on Voice AI's 'Her' Moment — from AI Engineer

The "Her" Moment in Voice AI

Zeghidour opened by framing the discussion around the concept of a truly conversational AI, akin to the sentient operating system Samantha from the movie "Her." He highlighted that while significant progress has been made, the goal of achieving seamless, natural, and empathetic human-AI interaction is still a work in progress. The current state of voice AI, while functional, often falls short of the nuanced and fluid communication expected from human conversations.

Gradium AI's Mission and Technology

Zeghidour introduced Gradium AI's mission: to unlock the unrealized potential of voice AI by making fluid, natural voice the new interface for AI. The company focuses on training voice models for various applications, including speech-to-text (STT), text-to-speech (TTS), and speech-to-speech (S2S) translation. This involves building foundational blocks for voice agents and solutions that can be integrated into various products.

Related startups

He elaborated on Gradium's approach, emphasizing the move from research to production. The company's work on "Moshi" was highlighted, which includes developing STT with semantic Voice Activity Detection (VAD), customizable LLMs for context, reasoning, and function calling, and streaming, multilingual TTS with voice cloning capabilities. This comprehensive approach aims to overcome the limitations of existing cascaded systems.

Challenges in Voice AI: Latency and Scalability

A significant portion of Zeghidour's talk focused on the persistent challenges in voice AI, primarily latency and scalability. He explained that current cascaded systems, which typically involve separate models for STT, LLM processing, and TTS, introduce inherent delays. The latency in these systems can hinder natural conversation flow, making interactions feel clunky and less human-like. He presented data showing that most current TTS models have latencies exceeding 200 milliseconds, which is a significant bottleneck for real-time conversations.

The presentation also touched upon the need for models that can handle complex reasoning and contextual understanding. Zeghidour pointed out that while current AI can perform specific tasks, achieving true conversational intelligence requires models that can maintain context across turns, understand user intent, and respond with a degree of empathy. He also raised the issue of scalability, noting that the computational resources required for advanced voice AI, particularly for inference, can be substantial, making cost and efficiency critical factors.

The Path Forward: End-to-End Models and On-Device Inference

Zeghidour proposed that the future of voice AI lies in developing end-to-end models that can process speech directly, bypassing the intermediate steps of cascaded systems. This approach, he explained, can significantly reduce latency and improve the overall naturalness of the interaction. He highlighted Gradium's "Phonon" model as an example of this approach, which runs real-time inference on CPU, offering faster processing and personalization without requiring extensive retraining.

He showcased benchmarks comparing Phonon to other leading TTS models, demonstrating its superior performance in terms of Word Error Rate (WER) and speaker similarity, all while operating with significantly lower latency and on less demanding hardware. The ability to run on-device means that these advanced voice capabilities can be deployed on a wider range of devices, including smartphones, without relying on cloud infrastructure, which also addresses privacy concerns.

Applying the "Her" Moment in Practice

Zeghidour concluded by inviting the audience to experience the advancements in voice AI firsthand. He shared examples of how voice AI can be used to create more natural and engaging user experiences, such as the travel agent chatbot demo. This demo illustrated how a voice AI could understand complex requests, retrieve relevant information, and respond in a conversational manner, mimicking human interaction more closely than ever before.

The presentation underscored the ongoing journey towards achieving the conversational AI envisioned in "Her," emphasizing that while the challenges are significant, the progress made by companies like Gradium AI is bringing that future closer to reality. The focus on efficiency, scalability, and natural interaction is key to unlocking the true potential of voice AI.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.