• StartupHub.ai
    StartupHub.aiAI Intelligence
Discover
  • Home
  • Search
  • Trending
  • News
Intelligence
  • Market Analysis
  • Comparison
  • Market Map
Workspace
  • Email Validator
  • Pricing
Company
  • About
  • Editorial
  • Terms
  • Privacy
  • v1.0.0
  1. Home
  2. News
  3. Anthropic Wins Ttft But Openai Dominates Llm Benchmarks
Back to News
Market research

Anthropic Wins TTFT, But OpenAI Dominates LLM Benchmarks

S
StartupHub Team
Dec 17, 2025 at 8:25 PM4 min read
Anthropic Wins TTFT, But OpenAI Dominates LLM Benchmarks

The latest LLM benchmarks reveal a critical trade-off between responsiveness and raw speed, defining how developers choose their AI coding backends.

The era of AI coding tools acting as a "force multiplier" is officially here, according to new cross-industry data. A study by Greptile, focused on the state of AI software development, shows that developer output—measured in lines of code per developer—has surged 76% in 2025, rising from 4,450 to 7,839 lines.

This productivity boom isn't just about writing more boilerplate; it’s changing the shape of development work. The median pull request (PR) size increased 33% between March and November 2025, suggesting developers are tackling larger, denser changes with AI assistance. Medium teams (6–15 developers) saw the most dramatic increase, boosting output by 89%.

While the productivity numbers are staggering, the underlying platform war remains a fierce contest defined by performance and cost.

OpenAI still holds the commanding lead in the infrastructure layer, with its SDK downloads reaching 130 million. However, the gap is closing rapidly. Greptile data shows the OpenAI-to-Anthropic SDK download ratio dropped from a staggering 47:1 in January 2024 to 4.2:1 by November 2025, confirming Anthropic’s aggressive push into the enterprise and developer market is paying off.

The Critical LLM Benchmarks: Speed vs. Responsiveness

The most revealing data comes from the head-to-head LLM benchmarks, which tested models like GPT-5.1, Claude Sonnet 4.5, GPT-5-Codex, Claude Opus 4.5, and Gemini 3 Pro across key metrics for coding agents: Time-to-First-Token (TTFT) and sustained throughput.

The results reveal a clear, expensive trade-off that dictates model choice:

Interactive Speed (TTFT): Anthropic Wins. For interactive coding sessions—where a developer needs instant feedback—Anthropic’s models are the undisputed champions. Claude Sonnet 4.5 and Opus 4.5 both delivered the first token in under 2.5 seconds (p50). This responsiveness is crucial for maintaining developer flow.

In stark contrast, OpenAI’s GPT-5-Codex and GPT-5.1 took more than twice as long, hovering around 5.0 to 5.5 seconds (p50). Google’s Gemini 3 Pro lagged significantly, requiring over 13 seconds for the first token.

Raw Throughput: OpenAI Dominates. When it comes to sustained generation speed—the metric that matters for batch jobs, CI/CD pipelines, and large-scale agent operations—OpenAI maintains a massive lead.

GPT-5 Codex and GPT-5.1 delivered the highest sustained throughput, peaking between 62 and 73 tokens per second (p75). Anthropic’s models, despite their quick start, settled into the middle tier, generating only 19 to 21 tokens per second. Gemini 3 Pro was relegated to the back, managing only 4 to 5 tokens per second.

This means that while Anthropic feels faster in a chat window, OpenAI can complete long code generations or complex agent tasks three to four times faster overall.

The Cost Penalty: This performance disparity is compounded by cost. When normalized to GPT-5 Codex (1.00×), Anthropic models carry a significant premium. Claude Sonnet 4.5 is 2.00× the cost, and the top-tier Claude Opus 4.5 is 3.30× the cost for an 8k input / 1k output workload. Gemini 3 Pro sits in the middle at 1.40×.

For teams running high-volume, high-throughput coding agents, the combination of OpenAI’s superior tokens-per-second rate and lower unit cost makes it the clear economic choice, even if the initial latency is higher. Anthropic is positioning itself as the premium, low-latency option for user-facing applications where instant response is paramount.

Beyond the immediate performance metrics, the Greptile report also highlights how the industry is moving past raw parameter counts. Recent research focuses heavily on efficiency (DeepSeek-V3’s Sparse MoE), context management (RetroLM’s KV-level retrieval), and learned internal state (MEM1), suggesting that the next generation of LLM benchmarks will focus less on simple speed and more on complex, long-horizon agent capabilities.

#AI
#AI Agents
#Anthropic
#Benchmarks
#Generative AI
#LLM
#Market Analysis
#OpenAI

AI Daily Digest

Get the most important AI news daily.

GoogleSequoiaOpenAIa16z
+40k readers