#Benchmarking

14 articles with this tag

Agentic RLHF Needs New Benchmarks
AI Research

Agentic RLHF Needs New Benchmarks

New benchmark Plan-RewardBench reveals current RMs struggle with agentic tool use and long-horizon tasks, highlighting the need for specialized trajectory-level reward modeling.

25 days ago
ClawBench: Testing Real-World AI Agents
AI Research

ClawBench: Testing Real-World AI Agents

ClawBench, a new evaluation framework, tests AI agents on real-world online tasks across live platforms, revealing significant performance gaps in current frontier models.

26 days ago
Medical VLMs Fail Critical Input Sanity Checks
AI Research

Medical VLMs Fail Critical Input Sanity Checks

Medical VLMs fail critical input validation tests, as revealed by the new MedObvious benchmark, highlighting a significant safety risk.

about 1 month ago
Supermemory CEO on AI Memory: "We need to get this right"
Artificial Intelligence

Supermemory CEO on AI Memory: "We need to get this right"

Supermemory CEO Dhravya Shah discusses the evolution of AI memory, the company's innovative approach to personalizing AI experiences, and the critical importance of getting memory systems right for the future of AI.

about 2 months ago
Standardizing Survival HTE Evaluation
AI Research

Standardizing Survival HTE Evaluation

Introducing SurvHTE-Bench, the first comprehensive benchmark for evaluating heterogeneous treatment effects in survival data, promoting reproducible and rigorous research.

2 months ago
TetrisBench: LLMs Conquer Tetris, Differently
Investor News

TetrisBench: LLMs Conquer Tetris, Differently

Yoko Li's TetrisBench project reveals how LLMs, initially struggling with direct play, develop surprising, distinct strategies when tasked with generating game logic, outperforming most humans but faltering against top players' adaptive chaos.

2 months ago
AI Coding Tests Flawed by Infrastructure Noise
Artificial Intelligence

AI Coding Tests Flawed by Infrastructure Noise

The infrastructure powering AI coding tests can significantly inflate or deflate model scores, potentially masking true capabilities and misleading deployment decisions.

3 months ago
AI Research

FACTS Benchmark Suite Elevates LLM Factuality Scrutiny

5 months ago
AI hits a wall on FrontierMath performance
AI Research

AI hits a wall on FrontierMath performance

7 months ago
NVIDIA Blackwell benchmarks show staggering AI economics
AI Research

NVIDIA Blackwell benchmarks show staggering AI economics

7 months ago
Salesforce AI, Berkeley Unveil BFCL Audio Benchmark for Voice AI Precision
AI Research

Salesforce AI, Berkeley Unveil BFCL Audio Benchmark for Voice AI Precision

Salesforce AI Research and UC Berkeley have unveiled BFCL Audio, a new benchmark designed to rigorously evaluate the precision of AI models in handling audio-native function calls. In an announcement on its blog, the collaboration...

8 months ago
MoNaCo Benchmark: A New Standard for Complex Question Answering
AI Research

MoNaCo Benchmark: A New Standard for Complex Question Answering

The benchmark exposed weaknesses in today鈥檚 most advanced models. Researchers tested 15 frontier LLMs, including GPT-5, Anthropic Claude Opus 4, Google Gemini 25 Pro, and OpenAI鈥檚 reasoning-focused o3.

9 months ago
The Shifting Sands of AI: Benchmarks, Open Source, and Infrastructure Wars
Artificial Intelligence

The Shifting Sands of AI: Benchmarks, Open Source, and Infrastructure Wars

10 months ago
Press Release

Temenos sets new benchmark for scalability of AI-powered banking with Microsoft

12 months ago