The most acute challenge facing developers and enterprise leaders navigating the sprawling frontier of Large Language Models (LLMs) is the profound lack of trusted, independent performance metrics. Every major lab releases models accompanied by self-reported benchmarks, often cherry-picked or optimized for specific evaluation sets, leading to a pervasive trust deficit. Artificial Analysis (AA), founded by George Cameron and Micah-Hill Smith, was born from this necessity, rapidly establishing itself as the independent gold standard for benchmarking, providing the objective data required to make sound deployment decisions across the entire AI stack.
The genesis of AA was less a grand strategic plan and more a necessary side project. Micah-Hill Smith, while building a legal AI assistant in 2023, repeatedly encountered the frustration of unreliable performance claims. He quickly realized that benchmarking itself became a core prerequisite for reliable development. As Smith noted during the interview, the process inevitably leads to a realization that "the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem." This initial frustration morphed into a public-facing comparison site launched in January 2024, which quickly gained traction after a viral retweet, validating the market's hunger for credible data.
The core of AA’s value proposition lies in its rigorous, independent methodology, designed specifically to circumvent the systemic biases of vendor self-reporting. They determined early on that they “had to run the evals ourselves and just run them in the same way across all the models.” This approach revealed stark disparities. Smith highlighted the extreme measures labs take to inflate scores, recalling that for early benchmarks, "Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU." To ensure their results reflect real-world performance delivered to average users, AA instituted a "mystery shopper policy," registering accounts not on their own domain to prevent labs from serving different, optimized models on private endpoints.
AA’s business model balances the need for widespread public utility with financial sustainability. They offer a wealth of data freely on their website to support the developer ecosystem. For enterprise clients, they provide proprietary benchmarking insights subscriptions, covering complex decisions like model deployment strategies—serverless inference versus managed solutions or leasing chips.
This commitment to independent evaluation extends to AA's comprehensive metrics, which are constantly evolving beyond saturated academic tests. The cornerstone is the Artificial Analysis Intelligence Index (AA-II) V3, which synthesizes results from ten evaluation datasets, ranging from traditional MMLU and GPQA to advanced agentic and long-context reasoning benchmarks, providing scores with a tight 95% confidence interval via repeated runs.
In response to the industry's struggle with factual reliability, AA introduced the Omniscience Index, a novel metric focusing explicitly on knowledge reliability and hallucination rates. This benchmark rewards accuracy but crucially punishes bad guesses, rewarding the model for admitting "I don't know." This methodology revealed fascinating training trade-offs. While Claude models may not always be the smartest overall, they consistently demonstrate the lowest hallucination rates. The Omniscience Index accuracy also suggests a strong correlation with total parameter count, pointing toward an industry shift toward large, sparse models.
For evaluating real-world economic value, AA developed GDPval-AA, their version of OpenAI's GDP-bench, evaluating AI models on 44 white-collar tasks involving spreadsheets, PDFs, and presentations. This agentic benchmark leverages their open-source Stirrup agent harness, enabling multi-turn interactions and code execution. The results, graded by Gemini 3 Pro as an LLM judge (after extensive testing to confirm no self-preference bias), provide a critical view of which models deliver actual economic utility in complex workflows.
Analyzing the market dynamics through their data, AA observes the "smiling curve" of AI costs. While the cost of achieving GPT-4 level intelligence has plummeted by "100-1000x cheaper than at launch," the expenditure on frontier reasoning models used in advanced agentic workflows is simultaneously rising. This dichotomy is driven by the increasing cost of serving highly complex, sparse models requiring long context windows and multi-turn capabilities. The evolution of token efficiency is also shifting the cost landscape, as models like GPT-5 might cost more per token but solve complex tasks in fewer turns, proving cheaper overall.
Artificial Analysis is actively preparing for the next wave of capability shifts, planning the Intelligence Index V4 update to incorporate GDP Val AA, Critical Point (hard physics reasoning), and hallucination rates, while retiring coding benchmarks that have become trivial for even smaller models. By continuously building and validating new evaluation frameworks, George Cameron and Micah-Hill Smith ensure that AA remains the critical, unbiased source of truth for the rapidly advancing capabilities and costs defining the AI ecosystem.

