Winner takes all isn’t panning out for Agentic AI. It’s going to be a symbiotic relationship with interconnected pieces working in harmony. The top mantel spots keep being dethroned, and it’s dizzying us all. It’s transient, and so, that spot is a construct, a fallacy rather.
‘Cambrian explosion’ is certainly an apt characterization of the field nowadays, where the minor catalytic oxygen-level increase is akin to the introduction of ChatGPT powered by the GPT3.5 foundation model back in November of 2022. Changing the course of human evolution ever since, it spurred a race in blow-for-blow breakthroughs, with LLM developers releasing better models by the week. We can literally hold our breath for the latest and better model to make it to our x.com feed.
As for OpenAI’s headstart, that was eclipsed.
Pilloried over the past 18 months for their meagre competitive offering in the LLM space, Google, aside from offering the lowest cost for the least useful model, set the largest fissure with their new model’s release. This group was the most perplexing, underachieving technology player in the ecosystem, having developed key AI tools like TensorFlow and TPUs, and acquired DeepMind. Their researchers created the attention mechanism, no less, giving way to LLMs, and later acquired by OpenAI folks. Google’s reprieve came this week with their debut of Gemini Flash 2.0, surpassing the performance of GPT4o. They quickly iterated and released another killer fine-tuned model; Deep Research.

This pace is accelerating. According to the authoritative body on all aspects of the LLM ecosystem, Artificial Analysis, chasms are struck each week separating the leader from the pack, with the latest Labs like Meta and Mistral starting to pull ahead, a feat accomplished in a mere 18 months. They might as well take up measuring innovation velocity. But in the meantime, they’re measuring key performance qualities of the LLMs, like speed, cost, context window, and accuracy among others. As a collective group, the industry has adopted benchmarks to measure them, feeding them into standardized tests that give us a sense of how well they perform compared to others.
They often include datasets that test reasoning, math, coding, and general knowledge. Some popular benchmarks include the MMLU (Massive Multitask Language Understanding), which assesses general knowledge; GPQA (General Purpose Question Answering), which tests for higher order reasoning; and the SWE-Bench for coding proficiency. More recently, methods like ELO scores, borrowed from chess rankings, are used to compare model performance based on head-to-head matchups. Beyond these, more nuanced benchmarks like Big-Bench Hard, which explores complex reasoning and problem-solving; HellaSwag, which probes for common-sense reasoning; and ARC-AGI, which tests the ability to reason in novel situations, are also becoming increasingly important. These scores provide a more comprehensive view of relative strengths and weaknesses among models, rather than absolute performance on a single task.
But suffice it to say, there’s not enough benchmarks out there. Each new LLM gains new capabilities, new nuances, new specializations at esoteric tasks. Some have cheaper memory, while others excel in fast-clip arithmetic tasks. Some are open-sourced, while others closed, or quasi with open weights. They’re the strategic national-interest of every country. The hottest capability now is reasoning; and extrapolated further by leading Agentic AI investor, Elad Gil, as paying for ‘units of cognition.’ The latest ecosystem survey by Artificial Analysis corroborates this all. There was 32.0x jump in context windows over the last year, now standard at 128k tokens. Moreover, 72% of developers now opt for off-the-shelf models, leveraging rapid iteration cycles of LLM capabilities instead of costly fine-tuning. Adoption is in hyper-drive. Not to mention multimodal models—image, speech, and video—which are in a league of their own. OpenAI’s Sora and Veo 2 just set new benchmarks. Cartesia and Eleven Labs outperform hyperscaler’s dated text-to-speech models. And transcription tools have slashed costs by 72.0x with speed gains of 200.0x—an hour of audio transcribed in 10 seconds. Speed is now table stakes; cost now drives decisions.
Welcome to choice overload. The “best” model for a given task is actually a moving target. This pendulum swings quickly. Just a few weeks ago, OpenAI released their successor to their fine-tuned model GPT4o (which some critics say is in fact an AI Agent that’s tapping multiple LLMs). o3, has achieved record-breaking performance in coding and math benchmarks, scoring 96.7% on AIME and 71.7% on SWE—20% better than o1. It also surpasses o1 by three times in answering novel reasoning questions on the ARC-AGI benchmark. The model is available for public safety testing. Pricing=a fortune.
Google reveled in their glory for all but one whole week. Lesson learned; you simply can’t choose a winner and side with one model or LLM developer–we know better than to be so cavalier. We should diversify and allocate our ‘investments’ appropriately.
Jeff Dean, the brains behind DeepMind and Chief Scientist at Google Research, backed a startup that does just that. Not Diamond is wagering that the future of AI isn't about finding the single “god model,” but about intelligently navigating the increasingly complex constellation of available options. Their bet? LLM model routing. "We bring order to this chaos, ensuring developers and businesses can consistently access the optimal model for their specific needs, regardless of the shifting landscape," said COO and co-founder, Jeffrey Akiki. They achieve this by ensembling every other model into a meta-model that learns when to call each LLM.
“Our out-of-the-box router is trained on robust cross-domain evaluation datasets, ranging from code generation to summarization, medicine, and law—allowing us to learn a mapping from inputs to model rankings. This lets us predict the ranking of models for any new prompt.”
Teams can leverage the Not Diamond’s framework, to efficiently train custom routers using their own data, and improve routing in real-time based on feedback.

Their journey began with notdiamond-0001 in late 2023, a solution focused on intelligently dispatching queries between GPT-3.5 and GPT-4. It promised tangible gains in accuracy, cost reduction, and latency. Their platform has drastically evolved, now offering a vast selection of over 40 pre-configured LLMs, including the latest models like o1-preview, as well as support for custom models and any arbitrary inference endpoint. Beyond basic routing, they provide a robust set of features, including RAG auto-optimization, tools for training custom routers, cost and latency tradeoffs, function calling, structured outputs, and personalized routing with feedback. It’s now a comprehensive platform designed to intelligently manage the complex and ever-growing landscape of LLMs.

The benefits of their technology are clear; superior performance combined with cost efficiency. Their router delivers state-of-the-art results, consistently outperforming individual foundation models. For instance, on the MMLU benchmark, Not Diamond achieves higher accuracy than GPT-4o while reducing costs by approximately 30%.
Above performance gains, they also pride their technology as a potential safety mechanism. It allows for a more distributed and specialized approach to AI, rather than relying on monolithic, general-purpose models.
Evidenced by the OpenAI outage earlier this month, it also clearly serves as guardrail to maintain uptime for applications in production.
The startup provides its routing technology through an API and a chat application. In the chat app, users can refine routing preferences by giving feedback, which improves their experience in real time. The app also offers an “Arena Mode” for comparing model responses or regenerating outputs with other models. Additionally, its “Smart Tradeoffs” feature optimizes performance by routing simpler queries through free-tier models to maintain efficiency and control costs. Built on the Not Diamond API, the chat app dynamically identifies the best model for each query and enhances recommendations through continuous feedback.
Launched only in August 2024, Not Diamond has amassed thousands of developers, making millions of API calls.
Not Diamond’s LLM routing technology is already fully compatible with Agentic workflows. Their documentation includes an example that demonstrates how to create a custom router directing queries between two RAG agents, dynamically selecting between weak and strong models within each agent to balance accuracy, cost, and latency. As autonomous LLM-powered applications rapidly move beyond their initial stages, Not Diamond is at the forefront, enabling users to optimize agent selection based on specific parameters—ensuring optimal performance.
LangChain, Langbase, and CrewAI all confirm that Agentic AI workflows are moving from research to production, where the percentage of respondents with working, implemented systems ranges from 16% up to 51%.
Anthropic recently penned a fastidious piece echoing the profundity of Not Diamond’s offering in such a setting. Defining Agentic AI workflows as “LLMs and tools orchestrated through predefined code paths,” while AI Agents are “systems where LLMs dynamically direct their own processes and tool usage while maintaining control,” they distinctly advise on the use of Agentic AI workflow routing, especially for specialized, more complex tasks.
The sector is exploding, with close to 800 venture-backed AI Agent startups, each focused on building these applications and countless more bootstrapped participants. That figure is widely expected to reach millions. Not Diamond is gearing to be the smart navigator, matching tasks to the most appropriate AI Agent. “Errors in agentic workflows can escalate rapidly, particularly in complex systems where each step builds on the previous one. This highlights the importance of selecting the right model or agent at every stage, enabling enterprises to maintain high accuracy and prevent failures.”
Akiki sees a future where routing technology becomes the central nervous system for AI workflows.
The startup is elbowing with a small group of competitors, like Aurelio AI, Martian, Storytell.ai, Unify, and Aguru. While the concept of model routing isn't entirely novel, Not Diamond sets itself apart by offering an adaptable framework that can take any evaluation data over any set of models for any set of inputs and learn an optimal recommendation algorithm tailored for virtually any use case.
Nevertheless, in these waters, the riptide is vicious. DeepSeek just released their open-source DeepSeek-R1 model, at parity with Open AI’s o1 in terms of performance, but at nearly 95% of the cost per input/output. While others are swimming against the full force of resistance, Not Diamond is betting with the current. The only caveat–a platform-play by an incumbent alliance might make some bi-directional waves. Case in point, Amazon, who feigns an ecosystem player, while introducing competitive applications every other week. Indeed, they now have ‘prompt routing,’ but critics call it smart shuffling within the same family. And now they’ve now thrown their hat into the multi-agent orchestration ring. And remember when they released Nova on Bedrock, what they claim to be a state-of-the-art foundation model to compete with benchmark leaders like OpenAI and Anthropic (hyperscaling needs a new definition) – we don’t either.
Competition isn’t giving way. Until then, we’ll tap the Not Diamond’s brain for a balanced portfolio and proven outperformance.
Not Diamond previously raised $2.3 million in pre-seed funding, led by defy.vc. The round included participation from Jeff Dean, Tom Preston-Werner, Ion Stoica, Zack Kass, Julien Chaumond and former LinkedIn CEO, Jeff Weiner. It was founded by serial entrepreneurs and AI experts Tomás Hernando Kofman (CEO), Tze-Yang Tung (CTO) and Jeffrey Akiki (COO).

