#Benchmarks

9 articles with this tag

Personalized AI Agents Now Have a Benchmark

A new iOSWorld benchmark reveals AI agents' struggles with personalized, multi-app tasks, highlighting the need for richer context and advanced reasoning capabilities.

11 days ago

AI Research

Evaluating Coding Agents: Lessons from SWE-rebench

Ibragim Badertdinov from Nebius shares key lessons from evaluating coding agents using the SWE-rebench benchmark, highlighting the importance of real-world tasks, reliable verification, and cost-effectiveness.

16 days ago

AI Research

Active Exploration Unlocks Spatial AI

New benchmark ESI-BENCH reveals active exploration is key to embodied spatial intelligence, exposing AI's 'action blindness' and metacognitive gaps.

about 1 month ago

AI Research

Unlocking AI Agents with Gym-Anything

Gym-Anything enables scalable creation of complex AI agent environments, leading to the vast CUA-World benchmark and more efficient VLM agents.

2 months ago

AI Research

François Chollet on ARC-AGI-3: The Future of AI Reasoning

François Chollet discusses ARC-AGI-3, a new benchmark for AI reasoning, highlighting current AI's limitations and the path toward general intelligence.

3 months ago