The independent verification of model performance has become one of the most critical challenges facing the global AI ecosystem. As George Cameron and Micah Hill-Smith, co-founders of Artificial Analysis (AA), explained in a recent interview, relying solely on proprietary labs to report their own performance metrics introduces a fundamental conflict of interest that distorts the competitive landscape. Cameron and Hill-Smith spoke with Swyx of Latent Space about the rapid evolution of their company, which has quickly become the independent gold standard for evaluating large language models (LLMs). The Australian-founded firm was born out of a simple necessity: a pervasive lack of reliable, objective data regarding model performance, speed, and cost.
The founders realized quickly that relying on self-reported metrics from the labs building these models was a fool’s errand. This systemic bias meant that developers trying to build reliable applications faced an impossible decision-making landscape. Labs often manipulate evaluations by prompting models differently or cherry-picking examples, leading to inflated, non-reproducible scores. Cameron points to one particularly egregious example: "Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU." To counteract this, Artificial Analysis adopted a rigorous methodology, including running all evaluations themselves and implementing a "mystery shopper policy," registering accounts not on their own domain to prevent labs from serving different models on private, optimized endpoints.
The mission of providing free, public data remains central to AA’s identity, allowing developers and companies to navigate the increasingly complex AI stack. This public transparency is supported by two commercial arms. The first is an enterprise benchmarking insights subscription, offering standardized reports on critical deployment decisions, such as choosing between serverless inference, managed solutions, or leasing chips for self-hosting. The second revenue stream is private custom benchmarking for AI companies themselves, helping them understand their own models' performance against specialized criteria. Critically, the founders maintain a strict firewall: "No one pays to be on the public leaderboard."
