Winner takes all isn’t panning out for Agentic AI. It’s going to be a symbiotic relationship with interconnected pieces working in harmony. The top mantel spots keep being dethroned, and it’s dizzying us all. It’s transient, and so, that spot is a construct, a fallacy rather.
‘Cambrian explosion’ is certainly an apt characterization of the field nowadays, where the minor catalytic oxygen-level increase is akin to the introduction of ChatGPT powered by the GPT3.5 foundation model back in November of 2022. Changing the course of human evolution ever since, it spurred a race in blow-for-blow breakthroughs, with LLM developers releasing better models by the week. We can literally hold our breath for the latest and better model to make it to our x.com feed.
As for OpenAI’s headstart, that was eclipsed.
Pilloried over the past 18 months for their meagre competitive offering in the LLM space, Google, aside from offering the lowest cost for the least useful model, set the largest fissure with their new model’s release. This group was the most perplexing, underachieving technology player in the ecosystem, having developed key AI tools like TensorFlow and TPUs, and acquired DeepMind. Their researchers created the attention mechanism, no less, giving way to LLMs, and later acquired by OpenAI folks. Google’s reprieve came this week with their debut of Gemini Flash 2.0, surpassing the performance of GPT4o. They quickly iterated and released another killer fine-tuned model; Deep Research.
This pace is accelerating. According to the authoritative body on all aspects of the LLM ecosystem, Artificial Analysis, chasms are struck each week separating the leader from the pack, with the latest Labs like Meta and Mistral starting to pull ahead, a feat accomplished in a mere 18 months. They might as well take up measuring innovation velocity. But in the meantime, they’re measuring key performance qualities of the LLMs, like speed, cost, context window, and accuracy among others. As a collective group, the industry has adopted benchmarks to measure them, feeding them into standardized tests that give us a sense of how well they perform compared to others.
They often include datasets that test reasoning, math, coding, and general knowledge. Some popular benchmarks include the MMLU (Massive Multitask Language Understanding), which assesses general knowledge; GPQA (General Purpose Question Answering), which tests for higher order reasoning; and the SWE-Bench for coding proficiency. More recently, methods like ELO scores, borrowed from chess rankings, are used to compare model performance based on head-to-head matchups. Beyond these, more nuanced benchmarks like Big-Bench Hard, which explores complex reasoning and problem-solving; HellaSwag, which probes for common-sense reasoning; and ARC-AGI, which tests the ability to reason in novel situations, are also becoming increasingly important. These scores provide a more comprehensive view of relative strengths and weaknesses among models, rather than absolute performance on a single task.
But suffice it to say, there’s not enough benchmarks out there. Each new LLM gains new capabilities, new nuances, new specializations at esoteric tasks. Some have cheaper memory, while others excel in fast-clip arithmetic tasks. Some are open-sourced, while others closed, or quasi with open weights. They’re the strategic national-interest of every country. The hottest capability now is reasoning; and extrapolated further by leading Agentic AI investor, Elad Gil, as paying for ‘units of cognition.’ The latest ecosystem survey by Artificial Analysis corroborates this all. There was 32.0x jump in context windows over the last year, now standard at 128k tokens. Moreover, 72% of developers now opt for off-the-shelf models, leveraging rapid iteration cycles of LLM capabilities instead of costly fine-tuning. Adoption is in hyper-drive. Not to mention multimodal models—image, speech, and video—which are in a league of their own. OpenAI’s Sora and Veo 2 just set new benchmarks. Cartesia and Eleven Labs outperform hyperscaler’s dated text-to-speech models. And transcription tools have slashed costs by 72.0x with speed gains of 200.0x—an hour of audio transcribed in 10 seconds. Speed is now table stakes; cost now drives decisions.
Welcome to choice overload. The “best” model for a given task is actually a moving target. This pendulum swings quickly. Just a few weeks ago, OpenAI released their successor to their fine-tuned model GPT4o (which some critics say is in fact an AI Agent that’s tapping multiple LLMs). o3, has achieved record-breaking performance in coding and math benchmarks, scoring 96.7% on AIME and 71.7% on SWE—20% better than o1. It also surpasses o1 by three times in answering novel reasoning questions on the ARC-AGI benchmark. The model is available for public safety testing. Pricing=a fortune.
