Personalized AI Agents Now Have a Benchmark

A new iOSWorld benchmark reveals AI agents' struggles with personalized, multi-app tasks, highlighting the need for richer context and advanced reasoning capabilities.

6 min read
Illustration of interconnected iOS apps and data streams representing the iOSWorld benchmark.
The iOSWorld benchmark provides a realistic simulation environment for evaluating AI agents' ability to interact with personalized user data across multiple iOS applications.

The quest for truly intelligent personal AI agents hinges on their ability to move beyond stateless instruction following to deeply understand and reason over a user's unique identity, history, and preferences. Current benchmarks, however, fall short by operating in impersonal sandboxes, failing to reflect the rich, interconnected data residing on a user's device.

Visual TL;DR. AI Agents Struggle due to Impersonal Sandboxes. Impersonal Sandboxes reveals need for Need for Context. Need for Context addressed by Introducing iOSWorld. Introducing iOSWorld features Realistic User Data. Realistic User Data enables New Benchmark Tasks. Introducing iOSWorld leads to Improved AI Agents.

  1. AI Agents Struggle: current AI agents fail at personalized, multi-app tasks
  2. Impersonal Sandboxes: existing benchmarks lack rich, interconnected user data
  3. Need for Context: AI needs to understand user identity, history, preferences
  4. Introducing iOSWorld: first interactive, native iOS simulator benchmark
  5. Realistic User Data: simulates digital life with 26 interconnected apps
  6. New Benchmark Tasks: 133 tasks across single, multi-app, and memory tiers
  7. Improved AI Agents: enables better reasoning over personalized user context
Visual TL;DR
Visual TL;DR — startuphub.ai AI Agents Struggle due to Impersonal Sandboxes. Impersonal Sandboxes reveals need for Need for Context. Need for Context addressed by Introducing iOSWorld. Introducing iOSWorld leads to Improved AI Agents due to reveals need for addressed by leads to AI Agents Struggle Impersonal Sandboxes Need for Context Introducing iOSWorld Improved AI Agents From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agents Struggle due to Impersonal Sandboxes. Impersonal Sandboxes reveals need for Need for Context. Need for Context addressed by Introducing iOSWorld. Introducing iOSWorld leads to Improved AI Agents due to reveals need for addressed by leads to AI AgentsStruggle ImpersonalSandboxes Need for Context IntroducingiOSWorld Improved AIAgents From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agents Struggle due to Impersonal Sandboxes. Impersonal Sandboxes reveals need for Need for Context. Need for Context addressed by Introducing iOSWorld. Introducing iOSWorld leads to Improved AI Agents due to reveals need for addressed by leads to AI Agents Struggle current AI agents fail at personalized,multi-app tasks Impersonal Sandboxes existing benchmarks lack rich,interconnected user data Need for Context AI needs to understand user identity,history, preferences Introducing iOSWorld first interactive, native iOS simulatorbenchmark Improved AI Agents enables better reasoning over personalizeduser context From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agents Struggle due to Impersonal Sandboxes. Impersonal Sandboxes reveals need for Need for Context. Need for Context addressed by Introducing iOSWorld. Introducing iOSWorld leads to Improved AI Agents due to reveals need for addressed by leads to AI AgentsStruggle current AI agentsfail atpersonalized,… ImpersonalSandboxes existing benchmarkslack rich,interconnected user… Need for Context AI needs tounderstand useridentity, history,… IntroducingiOSWorld first interactive,native iOSsimulator benchmark Improved AIAgents enables betterreasoning overpersonalized user… From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agents Struggle due to Impersonal Sandboxes. Impersonal Sandboxes reveals need for Need for Context. Need for Context addressed by Introducing iOSWorld. Introducing iOSWorld features Realistic User Data. Realistic User Data enables New Benchmark Tasks. Introducing iOSWorld leads to Improved AI Agents due to reveals need for addressed by features enables leads to AI Agents Struggle current AI agents fail at personalized,multi-app tasks Impersonal Sandboxes existing benchmarks lack rich,interconnected user data Need for Context AI needs to understand user identity,history, preferences Introducing iOSWorld first interactive, native iOS simulatorbenchmark Realistic User Data simulates digital life with 26interconnected apps New Benchmark Tasks 133 tasks across single, multi-app, andmemory tiers Improved AI Agents enables better reasoning over personalizeduser context From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai AI Agents Struggle due to Impersonal Sandboxes. Impersonal Sandboxes reveals need for Need for Context. Need for Context addressed by Introducing iOSWorld. Introducing iOSWorld features Realistic User Data. Realistic User Data enables New Benchmark Tasks. Introducing iOSWorld leads to Improved AI Agents due to reveals need for addressed by features enables leads to AI AgentsStruggle current AI agentsfail atpersonalized,… ImpersonalSandboxes existing benchmarkslack rich,interconnected user… Need for Context AI needs tounderstand useridentity, history,… IntroducingiOSWorld first interactive,native iOSsimulator benchmark Realistic UserData simulates digitallife with 26interconnected apps New BenchmarkTasks 133 tasks acrosssingle, multi-app,and memory tiers Improved AIAgents enables betterreasoning overpersonalized user… From startuphub.ai · The publishers behind this format

Bridging the Personalization Chasm with iOSWorld

To address this critical gap, researchers have introduced iOSWorld, the first interactive, native iOS simulator benchmark. This novel environment is built around a persistent user identity and encompasses 26 newly developed iOS apps. These apps feature interconnected data streams, including transactions, messages, travel records, social connections, and financial activity, creating a realistic simulation of a user's digital life. iOSWorld is structured with 133 tasks across three difficulty tiers: single-app tasks (27), multi-app tasks spanning 2 to 8 apps (60), and memory and personalization tasks requiring inference from personal data (46).

Related startups

Performance Realities and the Power of Context

Evaluations on the iOSWorld benchmark using frontier and open-source models highlight significant challenges. The best-performing configuration achieved only 52% overall accuracy, dropping to a stark 37% on multi-app tasks. Notably, privileged vision+XML access provided substantial gains of up to 26 percentage points for frontier models, underscoring the importance of richer input modalities. Smaller models, however, did not demonstrate similar benefits from this enhanced accessibility-tree input, suggesting architectural or training limitations.

The release of iOSWorld as an open-source benchmark, complete with apps, seeded data, tasks, rubrics, and evaluation code, marks a pivotal moment for AI research. It provides the community with a vital tool to rigorously assess and advance the capabilities of AI agents in realistic, personalized contexts.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.