#Benchmarks
9 articles with this tag
Personalized AI Agents Now Have a Benchmark
A new iOSWorld benchmark reveals AI agents' struggles with personalized, multi-app tasks, highlighting the need for richer context and advanced reasoning capabilities.

Evaluating Coding Agents: Lessons from SWE-rebench
Ibragim Badertdinov from Nebius shares key lessons from evaluating coding agents using the SWE-rebench benchmark, highlighting the importance of real-world tasks, reliable verification, and cost-effectiveness.
Active Exploration Unlocks Spatial AI
New benchmark ESI-BENCH reveals active exploration is key to embodied spatial intelligence, exposing AI's 'action blindness' and metacognitive gaps.
Unlocking AI Agents with Gym-Anything
Gym-Anything enables scalable creation of complex AI agent environments, leading to the vast CUA-World benchmark and more efficient VLM agents.

François Chollet on ARC-AGI-3: The Future of AI Reasoning
François Chollet discusses ARC-AGI-3, a new benchmark for AI reasoning, highlighting current AI's limitations and the path toward general intelligence.

AI Coding Benchmark Scores Skewed by Infrastructure
Infrastructure configuration, not just AI model prowess, can significantly skew benchmark results, complicating deployment decisions.
LMArena Series A lands $150M to standardize AI evaluation

Anthropic Wins TTFT, But OpenAI Dominates LLM Benchmarks
