OpenAI Slashes API Latency with WebSockets

OpenAI is significantly accelerating its AI agent workflows by adopting WebSockets for its Responses API. This move addresses a critical bottleneck: the cumulative latency introduced by traditional, sequential API requests.

Agentic systems, like code completion tools, often involve dozens of back-and-forth API calls to validate actions, process tool outputs, and build context. As AI models become faster, the overhead from these numerous network interactions becomes increasingly apparent, leaving users waiting.

The company detailed these efforts on OpenAI News, explaining how a persistent WebSocket connection replaces the need for repeated HTTP requests. This allows the API to maintain state, cache reusable information like tokenized text, and process model responses more efficiently.

When the API Became the Bottleneck

With the introduction of specialized hardware and faster models like GPT-5.3-Codex-Spark, the inference speed jumped dramatically. However, the traditional API architecture couldn't keep pace, with API service overhead dominating the workflow.

Previous optimizations focused on single-request latency, improving Time To First Token (TTFT) by nearly 45%. Yet, the underlying issue of processing full conversation context for every request persisted, creating a structural inefficiency.

Building a Persistent Connection

The solution involved rethinking the transport protocol. Instead of establishing new connections and resending full histories, OpenAI explored persistent connections. This would enable caching of conversational state and reduce redundant processing.

While options like gRPC were considered, WebSockets emerged as the preferred solution. Its familiarity and minimal disruption to existing API input/output shapes made it an attractive choice for developers. This technological leap is a key part of OpenAI Responses API WebSockets.

Early prototypes demonstrated significant potential by modeling agentic rollouts as single, long-running responses. This allowed the API to pause for tool execution and resume, effectively treating local tool calls like hosted services.

Keeping the API Familiar

To ensure a smooth developer experience, the launched version retained the familiar `response.create` structure. The server now maintains a connection-scoped cache of previous response states, accessible via `previous_response_id`. This cache includes the prior response object, tool definitions, and rendered tokens, avoiding the need to rebuild context from scratch.

This cached state allows for optimizations such as processing only new input for safety classifiers and validators, appending to cached tokens, and reusing model routing logic.

Setting a New Bar for Speed

An alpha program with coding agent startups validated the approach, reporting up to 40% improvements in agentic workflows. The production rollout saw immediate impact.

OpenAI's Codex platform rapidly migrated traffic to the WebSocket mode, achieving the target of over 1,000 tokens per second for GPT-5.3-Codex-Spark, with bursts reaching 4,000 TPS. Companies like Vercel, Cursor, and Cline have reported substantial latency reductions, with Vercel seeing up to a 40% decrease.

This development marks a significant enhancement to the Responses API, demonstrating the critical need for surrounding infrastructure to match the accelerating pace of AI model inference.

OpenAI Slashes API Latency with WebSockets

When the API Became the Bottleneck

Related startups

Building a Persistent Connection

Keeping the API Familiar

Setting a New Bar for Speed

AI Daily Digest