We’re not at the foothills. That would imply there’s a direction up the mountain. We’re in the inchoate state of the Agentic AI industry. 99% daily fodder among the furtive star-dust that will come together to represent the pillars of the industry.
Everything is up in the air, orbiting the central need for autonomous, personalized Large Language Models (LLMs) that are affordable and capable of rapidly learning, or even evolving based on their own internal decisions. Infrastructure tools, AI Agents, LLMs releasing newer models, newer modalities, AI Agent builders platform, rapid gamut expansion. I can enumerate the list for you but it grows by the hour.
My two cents; nobody has luxurious stability, including the LLM developers who are the first principal foundation of the industry. The structural makeup of the Agentic AI industry will likely confound us all.
What is clear is the importance of data. Indeed data is the lifeblood of pre-training. Venerated Ilya Sustkever just echoed that at NeurIPs 2024, hinting at his SSI1 foundation model that’s being trained on data beyond what the open internet has to offer. But data is the lifeblood of fine-tuning and outputting results, where the technique of Retrieval Augmented Generation (RAG) gives them task-specific utility. It’s where LLMs and applications thereof become useful to us, more than their representations during pre-training.
Access to premium and fresh data is now the fulcrum at which LLMs ride or die. It’s one of the last miles of LLMs’ value. It’s become the difference between a trial and a production deployment
LLMs can provide information or context from what they've learned, but their knowledge is limited to their training corpus. This inherent limitation means they cannot directly access and respond with up-to-date, contextually relevant information in real-time. An Agentic AI workflow that trades stocks at high frequency intervals isn’t useful if it can’t retrieve real-time data points and then learn from its performance. A simple prompt to ChatGPT is futile if it can’t access the latest article, or research paper to substantiate its response.
So the future of artificial intelligence is increasingly defined by gated access. Where data, once freely available, is becoming a resource to be compensated for with each byte of access.
The process started with OpenAI’s content deal partnerships. Of course, they scraped key web sources to pre-train their model, including wikipedia and StackOverflow and other news sources. User prompts resulted in data from the model’s representation. But industry dissatisfaction pressured them to throw in the towel and conform to fair play of data copyright. They began a long bout of content partnerships to secure high-quality data for model pre-training and to provide access to reliable, up-to-date information for AI outputs.
To date, OpenAI has signed over 30 partnerships and counting. The largest deal value was the News Corp agreement, reportedly valued at over $250 million over 5 years. Their three latest announced deals with Future, Hearst and GEDI, followed by Condé Nast, Los Alamos National Lab, TIME, Apple, Reddit, Vox Media, WAN-IFRA, Dotdash Meredith, Stack Overflow, Financial Times, Le Monde, PRISA, and Axel Springer. Google also has a reported $60 million annual licensing deal with Reddit for training data, and Perplexity AI is actively building content partnerships that likely serve both pre-training and RAG purposes as well.
