OpenAI's New Voice API Models

OpenAI is rolling out a new generation of real-time voice models designed to imbue voice applications with greater intelligence and responsiveness. This advancing voice intelligence initiative introduces three distinct models to its API, aiming to bridge the gap between human conversation and machine action.

The core of the update lies in three new models: GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper. These are intended to move voice interfaces beyond simple command-and-response to systems that can actively listen, reason, translate, and act as a conversation unfolds.

The New Voice AI Arsenal

GPT‑Realtime‑2 is positioned as OpenAI's first voice model with GPT‑5-class reasoning capabilities. It's built to handle complex requests, maintain conversational flow, and integrate with tools seamlessly. Developers can enable features like short preambles to signal processing, parallel tool calls for efficiency, and improved recovery mechanisms for errors.

Context handling sees a significant boost, with the context window expanding from 32K to 128K tokens. This allows for longer, more coherent interactions and complex task execution. The model also demonstrates stronger understanding of specialized terminology and domain-specific language, crucial for production environments.

Furthermore, GPT‑Realtime‑2 offers more controllable tone and delivery, allowing agents to respond with appropriate emotional nuance—calm, empathetic, or upbeat. Developers can also adjust the model's reasoning effort, balancing latency with the depth of analysis required for a given request.

GPT‑Realtime‑Translate aims to revolutionize multilingual communication. It supports live speech translation from over 70 input languages into 13 output languages, keeping pace with speakers in real time. This is a significant step for global customer support, sales, and educational platforms.

GPT‑Realtime‑Whisper is a new streaming speech-to-text model designed for ultra-low latency transcription. This ensures that live captions, meeting notes, and other speech-to-text applications feel instantaneous and natural.

Voice as the Next Interface

OpenAI highlights three emerging patterns in voice AI: voice-to-action, systems-to-voice, and voice-to-voice. These new Realtime voice models API are engineered to power these patterns, enabling more sophisticated voice-driven experiences.

Companies like Zillow are already leveraging GPT‑Realtime‑2 for complex voice interactions, reporting significant improvements in call success rates and compliance robustness. Deutsche Telekom is exploring GPT‑Realtime‑Translate for more natural cross-language customer interactions.

Safety features are integrated, including active classifiers to halt harmful content and developer tools for additional safeguards. The API also supports EU Data Residency and adheres to enterprise privacy commitments.

Pricing details indicate GPT‑Realtime‑2 is priced per audio token, GPT‑Realtime‑Translate per minute, and GPT‑Realtime‑Whisper per minute. Developers can begin experimenting with these new models in the OpenAI Playground.

OpenAI's New Voice API Models

The New Voice AI Arsenal

Related startups

Voice as the Next Interface

AI Daily Digest