Cloudflare Adds Voice to AI Agents

Cloudflare is injecting voice into its AI Agents SDK, aiming to bridge the gap between text-based interactions and natural conversation. For many, AI agents have been confined to chat interfaces, requiring users to master prompt engineering. This new experimental voice pipeline, available via the @cloudflare/voice package, allows agents to converse in real-time over existing WebSocket connections.

The integration means voice becomes just another input modality, managed by the same Durable Object infrastructure that powers the SDK. This approach preserves the agent's state, persistence, and tooling, avoiding the need to migrate to separate voice frameworks.

Voice as a Native Agent Feature

The @cloudflare/voice package offers flexibility with options like withVoice(Agent) for full voice agents and withVoiceInput(Agent) for speech-to-text-only use cases like dictation or voice search. React developers can leverage useVoiceAgent and useVoiceInput hooks, while framework-agnostic clients can use VoiceClient.

Cloudflare integrates Workers AI providers for immediate use, including Deepgram for continuous STT (Flux, Nova 3) and TTS (Aura), eliminating the need for external API keys to get started.

This enables developers to build agents that users can talk to, with responses synthesized back, all while maintaining a single WebSocket connection and persisting conversation history in SQLite.

Designed for Extensibility

Cloudflare emphasizes an open approach. The provider interfaces within @cloudflare/voice are intentionally minimal, encouraging third-party developers to build speech, telephony, and transport components. This aims to prevent vendor lock-in and allow developers to customize their voice architecture.

Simplified Voice Agent Development

The server-side setup is remarkably concise. Developers instantiate a continuous transcriber and a text-to-speech provider, then implement the onTurn() method to handle transcribed input. The client-side integration is equally streamlined, particularly with React hooks.

The underlying mechanism leverages the Agents SDK's Durable Object foundation. Audio streams over the existing WebSocket, initiating a continuous STT session. Once the STT model detects an utterance, the transcript is passed to the agent's logic. The agent's response is then synthesized into audio and streamed back to the client, with sentence-chunking for faster Time-to-First Audio.

Conversation history is automatically persisted in SQLite, ensuring continuity across reconnections and deployments.

Unified Multimodal Experiences

A key advantage is the unified handling of voice and text. A user can type a query, switch to voice, and then back to text, all interacting with the same agent instance and conversation history. This simplifies application architecture and provides a more cohesive user experience.

Cloudflare highlights reduced latency as a significant benefit. By keeping the audio transport, STT, and TTS within Cloudflare's network and utilizing Workers AI bindings, the pipeline minimizes the overhead associated with bouncing data between disparate services.

The ability to stream responses and synthesize audio sentence-by-sentence enhances the conversational feel. This is crucial for building truly interactive AI agents.

Beyond Conversation: Voice as Input

For use cases where speech is purely an input method, the withVoiceInput option provides a focused interface. This is ideal for applications like dictation or voice search, where a spoken response is not required.

The SDK also supports advanced features like scheduling spoken reminders and exposing tools to LLMs, mirroring the capabilities of non-voice agents. This allows for complex workflows, such as setting reminders via voice commands.

Flexibility and Integration

Runtime model switching for transcription is supported, allowing developers to select different STT models based on connection parameters. Pipeline hooks offer further customization by enabling interception and modification of data between stages.

The voice pipeline is designed to integrate with various transport layers, including WebSockets, Twilio for phone calls, and WebRTC. This allows a single agent to handle interactions across different channels.

Cloudflare's approach aims to make voice a natural extension of existing agent capabilities, rather than a separate, complex system. This move positions their platform for more natural, multimodal AI interactions.