#Benchmarking

18 articles with this tag

FrontierCode: AI Coding Benchmark Goes Beyond Correctness
Artificial Intelligence

FrontierCode: AI Coding Benchmark Goes Beyond Correctness

Cognition's FrontierCode benchmark redefines AI code evaluation, measuring real-world 'mergeability' and finding current models fall short of production standards.

12 days ago
Claude Code Benchmarking: Semantic Search vs. Grep
AI Research

Claude Code Benchmarking: Semantic Search vs. Grep

Turbopuffer's Kuba Rogut benchmarks semantic code retrieval on Claude Code, revealing how semantic search enhances AI agent precision and efficiency compared to grep.

17 days ago
Bertrand Charpentier on AI Benchmarking Challenges
AI Research

Bertrand Charpentier on AI Benchmarking Challenges

Bertrand Charpentier of Pruna AI discusses the challenges in AI benchmarking, the limitations of public leaderboards, and the importance of considering both quality and efficiency.

19 days ago
Coding Agent Inference Benchmark Revealed
Technology

Coding Agent Inference Benchmark Revealed

Together AI unveils a new benchmark for coding agent inference, highlighting performance under real-world load and significant cost advantages.

about 1 month ago
Agentic RLHF Needs New Benchmarks
AI Research

Agentic RLHF Needs New Benchmarks

New benchmark Plan-RewardBench reveals current RMs struggle with agentic tool use and long-horizon tasks, highlighting the need for specialized trajectory-level reward modeling.

2 months ago
ClawBench: Testing Real-World AI Agents
AI Research

ClawBench: Testing Real-World AI Agents

ClawBench, a new evaluation framework, tests AI agents on real-world online tasks across live platforms, revealing significant performance gaps in current frontier models.

2 months ago
Medical VLMs Fail Critical Input Sanity Checks
AI Research

Medical VLMs Fail Critical Input Sanity Checks

Medical VLMs fail critical input validation tests, as revealed by the new MedObvious benchmark, highlighting a significant safety risk.

3 months ago
Supermemory CEO on AI Memory: "We need to get this right"
Artificial Intelligence

Supermemory CEO on AI Memory: "We need to get this right"

Supermemory CEO Dhravya Shah discusses the evolution of AI memory, the company's innovative approach to personalizing AI experiences, and the critical importance of getting memory systems right for the future of AI.

3 months ago
Standardizing Survival HTE Evaluation
AI Research

Standardizing Survival HTE Evaluation

Introducing SurvHTE-Bench, the first comprehensive benchmark for evaluating heterogeneous treatment effects in survival data, promoting reproducible and rigorous research.

4 months ago
TetrisBench: LLMs Conquer Tetris, Differently
Investor News

TetrisBench: LLMs Conquer Tetris, Differently

Yoko Li's TetrisBench project reveals how LLMs, initially struggling with direct play, develop surprising, distinct strategies when tasked with generating game logic, outperforming most humans but faltering against top players' adaptive chaos.

4 months ago
AI Coding Tests Flawed by Infrastructure Noise
Artificial Intelligence

AI Coding Tests Flawed by Infrastructure Noise

The infrastructure powering AI coding tests can significantly inflate or deflate model scores, potentially masking true capabilities and misleading deployment decisions.

4 months ago
AI Research

FACTS Benchmark Suite Elevates LLM Factuality Scrutiny

6 months ago
AI hits a wall on FrontierMath performance
AI Research

AI hits a wall on FrontierMath performance

8 months ago
NVIDIA Blackwell benchmarks show staggering AI economics
AI Research

NVIDIA Blackwell benchmarks show staggering AI economics

8 months ago
Salesforce AI, Berkeley Unveil BFCL Audio Benchmark for Voice AI Precision
AI Research

Salesforce AI, Berkeley Unveil BFCL Audio Benchmark for Voice AI Precision

Salesforce AI Research and UC Berkeley have unveiled BFCL Audio, a new benchmark designed to rigorously evaluate the precision of AI models in handling audio-native function calls. In an announcement on its blog, the collaboration...

10 months ago
MoNaCo Benchmark: A New Standard for Complex Question Answering
AI Research

MoNaCo Benchmark: A New Standard for Complex Question Answering

The benchmark exposed weaknesses in today’s most advanced models. Researchers tested 15 frontier LLMs, including GPT-5, Anthropic Claude Opus 4, Google Gemini 25 Pro, and OpenAI’s reasoning-focused o3.

10 months ago
The Shifting Sands of AI: Benchmarks, Open Source, and Infrastructure Wars
Artificial Intelligence

The Shifting Sands of AI: Benchmarks, Open Source, and Infrastructure Wars

11 months ago
Press Release

Temenos sets new benchmark for scalability of AI-powered banking with Microsoft

about 1 year ago