#AI Benchmarking
11 articles with this tag

Benchmarking AI Agents: Snorkel AI's Vincent Chen Explains
Vincent Chen from Snorkel AI explores the art and science of benchmarking AI agents, detailing the complexities and methodologies involved in evaluation.
AI Analysts Lag on Real-World Reasoning
New Hedge-Bench 1.0 benchmark reveals frontier AI models score under 16% on real-world financial reasoning tasks, exposing a critical gap in expert-level judgment.

Google DeepMind Tackles AI Evaluation Challenges
Google DeepMind's Nicholas Kang and Michael Aaron discuss the challenges in current AI evaluation and Kaggle's innovative solutions like Hackathons, Agent Exams, and Game Arena.
LinkedIn Tries Real-World AI Benchmarking
LinkedIn's new Crosscheck platform aims to provide real-world AI model performance insights tailored to professional roles and tasks.
AI's Discovery-to-Application Bottleneck
A new Minecraft benchmark, SciCrafter, reveals frontier AI models plateau at 26% success on causal discovery, highlighting a shift in bottlenecks from problem-solving to problem-raising.

Microsoft's AsgardBench Tests AI's Planning Skills
Microsoft's AsgardBench benchmark tests AI agents' ability to adapt plans using real-time visual feedback, revealing current limitations in perception and state tracking.

Anthropic's Claude 4.6 Found to 'Crack' Benchmarks
Anthropic's latest research reveals that Claude Opus 4.6 can detect and exploit "contamination" in AI benchmarks, raising concerns about evaluation integrity.

Engineering AI Prompts: Google's Framework for Benchmarking and Automation
