OpenAI Slashes API Latency with WebSockets

OpenAI's Responses API now uses WebSockets to slash latency in AI agent workflows, achieving up to 40% speed improvements and enabling faster model inference.

3 min read
Diagram illustrating OpenAI's Responses API architecture with WebSockets.
OpenAI's new WebSocket integration streamlines AI agent workflows.· OpenAI News

OpenAI is significantly accelerating its AI agent workflows by adopting WebSockets for its Responses API. This move addresses a critical bottleneck: the cumulative latency introduced by traditional, sequential API requests.

Agentic systems, like code completion tools, often involve dozens of back-and-forth API calls to validate actions, process tool outputs, and build context. As AI models become faster, the overhead from these numerous network interactions becomes increasingly apparent, leaving users waiting.

The company detailed these efforts on OpenAI News, explaining how a persistent WebSocket connection replaces the need for repeated HTTP requests. This allows the API to maintain state, cache reusable information like tokenized text, and process model responses more efficiently.

When the API Became the Bottleneck

With the introduction of specialized hardware and faster models like GPT-5.3-Codex-Spark, the inference speed jumped dramatically. However, the traditional API architecture couldn't keep pace, with API service overhead dominating the workflow.

Related startups

Previous optimizations focused on single-request latency, improving Time To First Token (TTFT) by nearly 45%. Yet, the underlying issue of processing full conversation context for every request persisted, creating a structural inefficiency.

Building a Persistent Connection

The solution involved rethinking the transport protocol. Instead of establishing new connections and resending full histories, OpenAI explored persistent connections. This would enable caching of conversational state and reduce redundant processing.

While options like gRPC were considered, WebSockets emerged as the preferred solution. Its familiarity and minimal disruption to existing API input/output shapes made it an attractive choice for developers. This technological leap is a key part of OpenAI Responses API WebSockets.

Early prototypes demonstrated significant potential by modeling agentic rollouts as single, long-running responses. This allowed the API to pause for tool execution and resume, effectively treating local tool calls like hosted services.

Keeping the API Familiar

To ensure a smooth developer experience, the launched version retained the familiar `response.create` structure. The server now maintains a connection-scoped cache of previous response states, accessible via `previous_response_id`. This cache includes the prior response object, tool definitions, and rendered tokens, avoiding the need to rebuild context from scratch.

This cached state allows for optimizations such as processing only new input for safety classifiers and validators, appending to cached tokens, and reusing model routing logic.

Setting a New Bar for Speed

An alpha program with coding agent startups validated the approach, reporting up to 40% improvements in agentic workflows. The production rollout saw immediate impact.

OpenAI's Codex platform rapidly migrated traffic to the WebSocket mode, achieving the target of over 1,000 tokens per second for GPT-5.3-Codex-Spark, with bursts reaching 4,000 TPS. Companies like Vercel, Cursor, and Cline have reported substantial latency reductions, with Vercel seeing up to a 40% decrease.

This development marks a significant enhancement to the Responses API, demonstrating the critical need for surrounding infrastructure to match the accelerating pace of AI model inference.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.