#AI Benchmarking

11 articles with this tag

Benchmarking AI Agents: Snorkel AI's Vincent Chen Explains

Vincent Chen from Snorkel AI explores the art and science of benchmarking AI agents, detailing the complexities and methodologies involved in evaluation.

16 days ago

AI Research

AI Analysts Lag on Real-World Reasoning

New Hedge-Bench 1.0 benchmark reveals frontier AI models score under 16% on real-world financial reasoning tasks, exposing a critical gap in expert-level judgment.

17 days ago

AI Research

Google DeepMind Tackles AI Evaluation Challenges

Google DeepMind's Nicholas Kang and Michael Aaron discuss the challenges in current AI evaluation and Kaggle's innovative solutions like Hackathons, Agent Exams, and Game Arena.

26 days ago

tech

LinkedIn Tries Real-World AI Benchmarking

LinkedIn's new Crosscheck platform aims to provide real-world AI model performance insights tailored to professional roles and tasks.

29 days ago

AI Research

AI's Discovery-to-Application Bottleneck

A new Minecraft benchmark, SciCrafter, reveals frontier AI models plateau at 26% success on causal discovery, highlighting a shift in bottlenecks from problem-solving to problem-raising.

about 2 months ago

AI Research

Microsoft's AsgardBench Tests AI's Planning Skills

Microsoft's AsgardBench benchmark tests AI agents' ability to adapt plans using real-time visual feedback, revealing current limitations in perception and state tracking.

3 months ago

AI Research

Anthropic's Claude 4.6 Found to 'Crack' Benchmarks

Anthropic's latest research reveals that Claude Opus 4.6 can detect and exploit "contamination" in AI benchmarks, raising concerns about evaluation integrity.

3 months ago

AI Video

Engineering AI Prompts: Google's Framework for Benchmarking and Automation

8 months ago

AI Video

Qwen-Image-Edit Challenges Image Generation Landscape

8 months ago

Press Release

VERSES® Digital Brain Beats Google’s Top AI At “Gameworld 10k” Atari Challenge

about 1 year ago

Funding Round

LM Arena Secures $100 Million Seed Funding

about 1 year ago