#AI Benchmarking

11 articles with this tag

Benchmarking AI Agents: Snorkel AI's Vincent Chen Explains
AI Research

Benchmarking AI Agents: Snorkel AI's Vincent Chen Explains

Vincent Chen from Snorkel AI explores the art and science of benchmarking AI agents, detailing the complexities and methodologies involved in evaluation.

16 days ago
AI Analysts Lag on Real-World Reasoning
AI Research

AI Analysts Lag on Real-World Reasoning

New Hedge-Bench 1.0 benchmark reveals frontier AI models score under 16% on real-world financial reasoning tasks, exposing a critical gap in expert-level judgment.

17 days ago
Google DeepMind Tackles AI Evaluation Challenges
AI Research

Google DeepMind Tackles AI Evaluation Challenges

Google DeepMind's Nicholas Kang and Michael Aaron discuss the challenges in current AI evaluation and Kaggle's innovative solutions like Hackathons, Agent Exams, and Game Arena.

26 days ago
LinkedIn Tries Real-World AI Benchmarking
tech

LinkedIn Tries Real-World AI Benchmarking

LinkedIn's new Crosscheck platform aims to provide real-world AI model performance insights tailored to professional roles and tasks.

29 days ago
AI's Discovery-to-Application Bottleneck
AI Research

AI's Discovery-to-Application Bottleneck

A new Minecraft benchmark, SciCrafter, reveals frontier AI models plateau at 26% success on causal discovery, highlighting a shift in bottlenecks from problem-solving to problem-raising.

about 2 months ago
Microsoft's AsgardBench Tests AI's Planning Skills
AI Research

Microsoft's AsgardBench Tests AI's Planning Skills

Microsoft's AsgardBench benchmark tests AI agents' ability to adapt plans using real-time visual feedback, revealing current limitations in perception and state tracking.

3 months ago
Anthropic's Claude 4.6 Found to 'Crack' Benchmarks
AI Research

Anthropic's Claude 4.6 Found to 'Crack' Benchmarks

Anthropic's latest research reveals that Claude Opus 4.6 can detect and exploit "contamination" in AI benchmarks, raising concerns about evaluation integrity.

3 months ago
Engineering AI Prompts: Google's Framework for Benchmarking and Automation
AI Video

Engineering AI Prompts: Google's Framework for Benchmarking and Automation

8 months ago
Qwen-Image-Edit Challenges Image Generation Landscape
AI Video

Qwen-Image-Edit Challenges Image Generation Landscape

8 months ago
Press Release

VERSES® Digital Brain Beats Google’s Top AI At “Gameworld 10k” Atari Challenge

about 1 year ago
Funding Round

LM Arena Secures $100 Million Seed Funding

about 1 year ago
#AI Benchmarking Articles | StartupHub.ai