#Benchmarks

9 articles with this tag

Personalized AI Agents Now Have a Benchmark
AI Research

Personalized AI Agents Now Have a Benchmark

A new iOSWorld benchmark reveals AI agents' struggles with personalized, multi-app tasks, highlighting the need for richer context and advanced reasoning capabilities.

11 days ago
Evaluating Coding Agents: Lessons from SWE-rebench
AI Research

Evaluating Coding Agents: Lessons from SWE-rebench

Ibragim Badertdinov from Nebius shares key lessons from evaluating coding agents using the SWE-rebench benchmark, highlighting the importance of real-world tasks, reliable verification, and cost-effectiveness.

16 days ago
Active Exploration Unlocks Spatial AI
AI Research

Active Exploration Unlocks Spatial AI

New benchmark ESI-BENCH reveals active exploration is key to embodied spatial intelligence, exposing AI's 'action blindness' and metacognitive gaps.

about 1 month ago
Unlocking AI Agents with Gym-Anything
AI Research

Unlocking AI Agents with Gym-Anything

Gym-Anything enables scalable creation of complex AI agent environments, leading to the vast CUA-World benchmark and more efficient VLM agents.

2 months ago
François Chollet on ARC-AGI-3: The Future of AI Reasoning
AI Research

François Chollet on ARC-AGI-3: The Future of AI Reasoning

François Chollet discusses ARC-AGI-3, a new benchmark for AI reasoning, highlighting current AI's limitations and the path toward general intelligence.

3 months ago
AI Coding Benchmark Scores Skewed by Infrastructure
Artificial Intelligence

AI Coding Benchmark Scores Skewed by Infrastructure

Infrastructure configuration, not just AI model prowess, can significantly skew benchmark results, complicating deployment decisions.

3 months ago
Funding Round

LMArena Series A lands $150M to standardize AI evaluation

5 months ago
Anthropic Wins TTFT, But OpenAI Dominates LLM Benchmarks
Market Research

Anthropic Wins TTFT, But OpenAI Dominates LLM Benchmarks

6 months ago
NeuroDiscoveryBench Sets New Standard for Neuroscience AI Benchmarks
AI Research

NeuroDiscoveryBench Sets New Standard for Neuroscience AI Benchmarks

6 months ago