AI's Data Problem: More Isn't Always Better

Janusz Marecki, CEO of Fractal Brain, discusses the limitations of current AI models and the shift towards data quality and specialized techniques like synthetic data.

4 min read
Janusz Marecki, CEO and Founder of Fractal Brain, speaking on a podcast.
Image credit: Merryn Talks Money

In the rapidly evolving world of artificial intelligence, the drive for bigger and better models often centers on the sheer volume of data used for training. However, a recent discussion featuring Janusz Marecki, CEO and Founder of Fractal Brain, and an AI Partner at Ahren Innovation Capital, highlighted a critical nuance: the quality and diversity of data are paramount, and simply adding more might not be the solution to AI's persistent challenges.

The Data Dilemma in AI

Merryn Somerset Webb, host of the "Merryn Talks Money" podcast, initiated the conversation by probing Marecki on the current state of AI development. Marecki, an expert with a background in AI research and investment, pointed out a significant hurdle: the phenomenon of 'hallucinations' in AI models. These are instances where models confidently generate incorrect or nonsensical information, a direct consequence of the data they are trained on.

Marecki elaborated on the common approach of throwing more data at the problem. "We keep pouring money into bigger data centers, knowing that we've used all the data already," he stated, emphasizing the potential futility of this strategy. He likened the current situation to a calculator that is 95-99% accurate. While impressive, the remaining error margin can be critical, leading to the generation of incorrect outputs that are difficult to distinguish from correct ones.

Related startups

The full discussion can be found on Bloomberg Podcast's YouTube channel.

Why You Should Wait Out AI’s Super-Spending False Start | Merryn Talks Money - Bloomberg Podcast
Why You Should Wait Out AI’s Super-Spending False Start | Merryn Talks Money — from Bloomberg Podcast

The Limits of Scale

A key point raised was the concept of 'data ceiling' or 'diminishing returns' in AI training. Marecki explained that while models like GPT-4, released in 2023, were trained on vast amounts of publicly available internet data, the training process for such models, completed in late 2022, essentially exhausted the readily available, high-quality data. The subsequent data generated, often by AI models themselves, can be less reliable and even lead to a degradation in performance, a phenomenon known as 'model collapse'.

He illustrated this with an analogy: "It's like having access to a calculator that claims to be 99% accurate. But if that 1% error happens when you're trying to solve a critical problem, it's a disaster." Similarly, AI models trained on their own outputs, which may contain subtle errors or biases, can perpetuate and amplify these flaws.

New Frontiers: Synthetic Data and retrieval-augmented generation

To overcome these limitations, Marecki highlighted emerging techniques. One such approach is the use of synthetic data. Instead of relying solely on existing, potentially flawed data, companies are exploring ways to generate new, high-quality data that is specifically tailored to train AI models more effectively. This involves creating artificial datasets that mimic real-world scenarios but are free from the biases and errors inherent in naturally occurring data.

Another promising technique is retrieval-augmented generation (RAG). This method combines the power of large language models with external knowledge bases. When a model needs to answer a question or generate text, it first retrieves relevant information from a trusted source and then uses that information to formulate a more accurate and grounded response. This approach helps to mitigate hallucinations by grounding the AI's output in verified facts.

The Path Forward: Efficiency and Reliability

Marecki stressed that the future of AI development lies not just in scaling up existing models but in creating more efficient, reliable, and understandable systems. This involves finding ways to train models that are more data-efficient, require less computational power, and are less prone to generating erroneous outputs.

He also touched upon the importance of specialized AI models. While general-purpose models like GPT-4 are impressive, there's a growing need for AI systems that are fine-tuned for specific tasks and domains. These specialized models, trained on curated datasets relevant to their intended applications, can often achieve higher accuracy and reliability than their more general counterparts.

The conversation underscored a critical shift in the AI industry: from a focus on sheer scale to a more nuanced approach that prioritizes data quality, model efficiency, and the development of systems that are both powerful and trustworthy. As Marecki aptly put it, "We're moving from a gold rush to a more sustainable, engineering-driven phase of AI development."

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.