We’re not at the foothills. That would imply there’s a direction up the mountain. We’re in the inchoate state of the Agentic AI industry. 99% daily fodder among the furtive star-dust that will come together to represent the pillars of the industry.
Everything is up in the air, orbiting the central need for autonomous, personalized Large Language Models (LLMs) that are affordable and capable of rapidly learning, or even evolving based on their own internal decisions. Infrastructure tools, AI Agents, LLMs releasing newer models, newer modalities, AI Agent builders platform, rapid gamut expansion. I can enumerate the list for you but it grows by the hour.
My two cents; nobody has luxurious stability, including the LLM developers who are the first principal foundation of the industry. The structural makeup of the Agentic AI industry will likely confound us all.
What is clear is the importance of data. Indeed data is the lifeblood of pre-training. Venerated Ilya Sustkever just echoed that at NeurIPs 2024, hinting at his SSI1 foundation model that’s being trained on data beyond what the open internet has to offer. But data is the lifeblood of fine-tuning and outputting results, where the technique of Retrieval Augmented Generation (RAG) gives them task-specific utility. It’s where LLMs and applications thereof become useful to us, more than their representations during pre-training.
Access to premium and fresh data is now the fulcrum at which LLMs ride or die. It’s one of the last miles of LLMs’ value. It’s become the difference between a trial and a production deployment
LLMs can provide information or context from what they've learned, but their knowledge is limited to their training corpus. This inherent limitation means they cannot directly access and respond with up-to-date, contextually relevant information in real-time. An Agentic AI workflow that trades stocks at high frequency intervals isn’t useful if it can’t retrieve real-time data points and then learn from its performance. A simple prompt to ChatGPT is futile if it can’t access the latest article, or research paper to substantiate its response.
So the future of artificial intelligence is increasingly defined by gated access. Where data, once freely available, is becoming a resource to be compensated for with each byte of access.
The process started with OpenAI’s content deal partnerships. Of course, they scraped key web sources to pre-train their model, including wikipedia and StackOverflow and other news sources. User prompts resulted in data from the model’s representation. But industry dissatisfaction pressured them to throw in the towel and conform to fair play of data copyright. They began a long bout of content partnerships to secure high-quality data for model pre-training and to provide access to reliable, up-to-date information for AI outputs.
To date, OpenAI has signed over 30 partnerships and counting. The largest deal value was the News Corp agreement, reportedly valued at over $250 million over 5 years. Their three latest announced deals with Future, Hearst and GEDI, followed by Condé Nast, Los Alamos National Lab, TIME, Apple, Reddit, Vox Media, WAN-IFRA, Dotdash Meredith, Stack Overflow, Financial Times, Le Monde, PRISA, and Axel Springer. Google also has a reported $60 million annual licensing deal with Reddit for training data, and Perplexity AI is actively building content partnerships that likely serve both pre-training and RAG purposes as well.
To consolidate their latest RAG capabilities, OpenAI launched SearchGPT—a product that integrates fresh content from their partners into ChatGPT models. However, it was quickly repackaged as a simple ‘web search’ toggle within the standard ChatGPT interface. This innovation spurred competitors like Perplexity AI and Google’s AI Studio to follow suit. Google calls this process “grounding,” while Perplexity offers it natively, along with a beta API for developers. HuggingFace Chat also includes web search capabilities, as well as tools to natively call other functionalities directly within a chat prompt.
When RAG and web search do not meet user expectations, function calling becomes an alternative. This method allows users to specify which API source the LLM should access and process before generating an output, relying on the user's knowledge of the desired source.
A new France-based startup is tackling these drawbacks with a nuanced approach that aims to bring the long-tail of content and premium data to the LLM-applications and AI Agents. Linkup, founded in November this year, developed a RAG API for LLMs to tap in order to furnish their users with access to their network of content. “We augment LLMs with fresh, premium information in a legal way,” explained CEO and co-founder, Philippe Mizrahi.
While function calling and web search offer partial solutions, they have limitations. Web search depends on search engine indexing and retrieval techniques, which may not fully capture the breadth of available information. This limitation is evident in the existence of multiple search engines like Bing, DuckDuckGo, and Brave, each attempting to address disparities in search results.
For AI Agents, techniques such as web scraping are employed, utilizing tools like Firecrawl and SerpAPI, or adopting agentic workflows with frameworks like LangGraph. However, these methods may not always identify the specific data a user requires, especially if the information is within sources accessible through Linkup’s growing network. Additionally, web scraping can be a slow process, making it less feasible for real-time applications.
“Linkup is 15x faster than web scraping methods to retrieve data,” commented Mizrahi. “We integrated directly with our partners backend systems, to serve data in the most clean and efficient way for our consumers”
Linkup fashions itself as a search and access engine specifically designed for LLMs, providing a marketplace for these AI models to access premium content through legal and ethical means. They ink direct partnerships with premium content providers across various domains such as news, research, and specialist data publications, such as AFP and various news outlets. But they do it at scale. “Most of that [partnership] money actually goes to the source,” added Mizrahi.
Of course, there’s Grok, which is natively integrated with X.com. And X.com is indeed the fastest news source available today, which serves Grok well in news related queries. Although, fact checking isn’t their forte.
Linkup’s offering becomes increasingly relevant as the incumbents move at snail's pace in bringing the long tail of content into the hands of LLM applications. “The Le Monde x OpenAI partnership took one year to sign, and so many resources” explained Mizrahi – drawing on the inefficiencies of pursuing bilateral agreements. “Our licensing platform is also where we're doing the hard work, relieving content providers of their battle with negotiations.”
Anthropic is also in the mix, but through a more universal approach. They released their Model Context Protocol (MCP), an open standard for AI to connect to any data source. Linkup recently announced a native integration with the protocol, giving an extreme edge to Anthropic’s Desktop app users and developers building applications using their APIs by providing a direct, scalable channel to Linkup's curated content.

“This augmentation with fresh information, now more easily accessible through MCP’s open architecture, allows LLMs to provide better overall responses.”
Revisiting Agentic AI workflows, the MCP architecture powered by Linkup is a central tenet to an AI Agent, or swarm of Agents, that can retrieve the relevant data. Especially without presetting the source. This is the dynamic nature that epitomizes Agentic AI.
Their MCP integration, enhancing the scalability and directness of integration, comes on the heels of the integration with LangChain, to simplify the process for AI Agent developers seeking to leverage its capabilities. More recently, they integrated with Phospho (Tak), an LLM chat web search startup going head-to-head with ChatGPT, Perplexity, You.com, Tavily, Liner, and the like.
Opposing Linkup’s direction is Cloudlfare, nearing the implementation of a security network that enables websites and data providers to charge LLM applications for each byte of access to their data. LLMs access external data through bots that browse the given site, and cloudflare is also girding to enable website owners to monetize their data, and to “set a fair price for the right to scan content and transact seamlessly.” That rollout is expected in Q1 of 2025.
A wild fix would be daily retraining of foundation LLMs. That could eliminate the need for complex grounding and function calls altogether by keeping knowledge inherently current and not static from their pre-training dataset. But the cost would be astronomical; with single runs of models like Llama 3 costing tens of millions and GPT-4 potentially reaching hundreds of millions, daily retraining would explode into billions, annually. That’s why RAG is so economical today.
Bypassing the limitations of static training data, RAG enhances LLMs by combining the strength of pre-trained models with external knowledge sources. It’s an area with intense innovation. Current research is actively exploring innovative ways to further enhance RAG techniques. Pleias, for example, introduced a method that involves analyzing the internal "attention scores" of LLMs to better understand how they process retrieved sources, aiming to improve the reliability and accuracy of citation practices and overall information usage. There’s also graph-based RAG techniques, query expansion, re-ranking, multi-hop retrieval, hybrid retrieval, and iterative retrieval. They all aim to increase the performance of RAG by better utilizing the available knowledge through more effective retrieval processes, moving beyond simple keyword searches based on the chunking of the source data.
Building better foundation models is increasingly ineffective when tapping the internet blithely for training data. Much of the content on the open web, once considered the ultimate source of truth, is now increasingly degraded – it's often filled with low-quality information, a lot of which is being generated by AI itself. It's this very issue, the rapid saturation of the internet with "poor quality data," as Linkup CEO Philippe Mizrahi puts it, that further exacerbates the need for access to high-quality information.
Thus far, their network of content partners includes renowned institutions like the Centre Français de la Copie (CFC), Cour de Cassation, Banque de France, and INSEE, alongside prominent media outlets such as VSD, Sociéte, HEROES, TÊTU, Public, and Gomantak Times. Access to these sources lends LLM applications high quality content in the fields of legal documents, statistical and economic data, news articles, and culturally relevant publications.
According to Mizrahi’s estimates, this market is expected to be between $50 and $100 billion annualized in the coming years. He reckons it excludes the existing AI model’s retrieval of ‘stolen data’–a nod to their efforts in ensuring ethically sourced data access.
Linkup is funded with $3.2 million from Seedcamp, Axeleo Capital, Motier Ventures, Financière Saint James, OPRTRS Club, Kima Ventures and several media angels.
“We’ll see so many Agents that are specialized and need to work together, and it’s exciting to see who builds what.”
Content providers can join Linkup’s network by applying here.

