Personalized AI Agents Now Have a Benchmark

The quest for truly intelligent personal AI agents hinges on their ability to move beyond stateless instruction following to deeply understand and reason over a user's unique identity, history, and preferences. Current benchmarks, however, fall short by operating in impersonal sandboxes, failing to reflect the rich, interconnected data residing on a user's device.

Visual TL;DR. AI Agents Struggle due to Impersonal Sandboxes. Impersonal Sandboxes reveals need for Need for Context. Need for Context addressed by Introducing iOSWorld. Introducing iOSWorld features Realistic User Data. Realistic User Data enables New Benchmark Tasks. Introducing iOSWorld leads to Improved AI Agents.

AI Agents Struggle: current AI agents fail at personalized, multi-app tasks
Impersonal Sandboxes: existing benchmarks lack rich, interconnected user data
Need for Context: AI needs to understand user identity, history, preferences
Introducing iOSWorld: first interactive, native iOS simulator benchmark
Realistic User Data: simulates digital life with 26 interconnected apps
New Benchmark Tasks: 133 tasks across single, multi-app, and memory tiers
Improved AI Agents: enables better reasoning over personalized user context

Visual TL;DRQuickExplainDeeper

Bridging the Personalization Chasm with iOSWorld

To address this critical gap, researchers have introduced iOSWorld, the first interactive, native iOS simulator benchmark. This novel environment is built around a persistent user identity and encompasses 26 newly developed iOS apps. These apps feature interconnected data streams, including transactions, messages, travel records, social connections, and financial activity, creating a realistic simulation of a user's digital life. iOSWorld is structured with 133 tasks across three difficulty tiers: single-app tasks (27), multi-app tasks spanning 2 to 8 apps (60), and memory and personalization tasks requiring inference from personal data (46).

Performance Realities and the Power of Context

Evaluations on the iOSWorld benchmark using frontier and open-source models highlight significant challenges. The best-performing configuration achieved only 52% overall accuracy, dropping to a stark 37% on multi-app tasks. Notably, privileged vision+XML access provided substantial gains of up to 26 percentage points for frontier models, underscoring the importance of richer input modalities. Smaller models, however, did not demonstrate similar benefits from this enhanced accessibility-tree input, suggesting architectural or training limitations.

The release of iOSWorld as an open-source benchmark, complete with apps, seeded data, tasks, rubrics, and evaluation code, marks a pivotal moment for AI research. It provides the community with a vital tool to rigorously assess and advance the capabilities of AI agents in realistic, personalized contexts.

Personalized AI Agents Now Have a Benchmark

Bridging the Personalization Chasm with iOSWorld

Related startups

Performance Realities and the Power of Context

AI Daily Digest