Red Hat's Cedric Clyburn on Structuring Unstructured Data for AI

Red Hat's Cedric Clyburn discusses the challenges of unstructured data for AI and introduces Docling, an open-source solution for advanced document parsing.

7 min read
Presentation slide with the title 'Structuring the Unstructured: Advanced Document Parsing for AI Workflows' and speaker information.
AI Engineer

In the rapidly evolving AI landscape, the ability to effectively process and structure unstructured data is paramount. Cedric Clyburn, a Senior Developer Advocate at Red Hat, recently shared insights into this challenge and introduced a promising open-source solution called Docling. In his presentation, "Structuring the Unstructured: Advanced Document Parsing for AI Workflows," Clyburn highlighted the pervasive nature of unstructured data and the limitations of current tools, advocating for a more robust and efficient approach.

Red Hat's Cedric Clyburn on Structuring Unstructured Data for AI - AI Engineer
Red Hat's Cedric Clyburn on Structuring Unstructured Data for AI — from AI Engineer

Visual TL;DR. Unstructured Data Volume leads to Current Solutions Limited. Current Solutions Limited introduces Introducing Docling. Introducing Docling shows Docling's Capabilities. Docling's Capabilities enables AI Data Structuring. Introducing Docling part of Docling Ecosystem. AI Data Structuring results in Future AI Workflows.

Related startups

  1. Unstructured Data Volume: 85% of world's data is unstructured, needs AI processing
  2. Current Solutions Limited: Existing tools struggle to transform data for LLMs
  3. Introducing Docling: Open-source solution for advanced document parsing
  4. Docling's Capabilities: Demonstrates advanced parsing for AI workflows
  5. Docling Ecosystem: Building a community around document parsing
  6. AI Data Structuring: Enables LLMs to understand and utilize diverse data
  7. Future AI Workflows: More efficient and robust AI processing
Visual TL;DR
Visual TL;DR, startuphub.ai Unstructured Data Volume leads to Current Solutions Limited. Current Solutions Limited introduces Introducing Docling. AI Data Structuring results in Future AI Workflows leads to introduces results in Unstructured Data Volume Current Solutions Limited Introducing Docling AI Data Structuring Future AI Workflows From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Unstructured Data Volume leads to Current Solutions Limited. Current Solutions Limited introduces Introducing Docling. AI Data Structuring results in Future AI Workflows leads to introduces results in Unstructured DataVolume Current SolutionsLimited IntroducingDocling AI DataStructuring Future AIWorkflows From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Unstructured Data Volume leads to Current Solutions Limited. Current Solutions Limited introduces Introducing Docling. AI Data Structuring results in Future AI Workflows leads to introduces results in Unstructured Data Volume 85% of world's data is unstructured, needsAI processing Current Solutions Limited Existing tools struggle to transform datafor LLMs Introducing Docling Open-source solution for advanced documentparsing AI Data Structuring Enables LLMs to understand and utilizediverse data Future AI Workflows More efficient and robust AI processing From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Unstructured Data Volume leads to Current Solutions Limited. Current Solutions Limited introduces Introducing Docling. AI Data Structuring results in Future AI Workflows leads to introduces results in Unstructured DataVolume 85% of world's datais unstructured,needs AI processing Current SolutionsLimited Existing toolsstruggle totransform data for… IntroducingDocling Open-sourcesolution foradvanced document… AI DataStructuring Enables LLMs tounderstand andutilize diverse… Future AIWorkflows More efficient androbust AIprocessing From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Unstructured Data Volume leads to Current Solutions Limited. Current Solutions Limited introduces Introducing Docling. Introducing Docling shows Docling's Capabilities. Docling's Capabilities enables AI Data Structuring. Introducing Docling part of Docling Ecosystem. AI Data Structuring results in Future AI Workflows leads to introduces shows enables part of results in Unstructured Data Volume 85% of world's data is unstructured, needsAI processing Current Solutions Limited Existing tools struggle to transform datafor LLMs Introducing Docling Open-source solution for advanced documentparsing Docling's Capabilities Demonstrates advanced parsing for AIworkflows Docling Ecosystem Building a community around documentparsing AI Data Structuring Enables LLMs to understand and utilizediverse data Future AI Workflows More efficient and robust AI processing From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Unstructured Data Volume leads to Current Solutions Limited. Current Solutions Limited introduces Introducing Docling. Introducing Docling shows Docling's Capabilities. Docling's Capabilities enables AI Data Structuring. Introducing Docling part of Docling Ecosystem. AI Data Structuring results in Future AI Workflows leads to introduces shows enables part of results in Unstructured DataVolume 85% of world's datais unstructured,needs AI processing Current SolutionsLimited Existing toolsstruggle totransform data for… IntroducingDocling Open-sourcesolution foradvanced document… Docling'sCapabilities Demonstratesadvanced parsingfor AI workflows Docling Ecosystem Building acommunity arounddocument parsing AI DataStructuring Enables LLMs tounderstand andutilize diverse… Future AIWorkflows More efficient androbust AIprocessing From startuphub.ai · The publishers behind this format

The Challenge of Unstructured Data

Clyburn began by emphasizing the sheer volume of unstructured data, stating that 85% of the world's data exists in these formats. This data, ranging from PDFs and presentations to contracts and technical documents, needs to be transformed into a format that Large Language Models (LLMs) can readily understand and utilize. He pointed out that while many AI applications and agents are popular for their ability to extract value from data, they often struggle with the raw, unstructured formats that dominate enterprise data.

Limitations of Existing Solutions

The presentation touched upon the shortcomings of current document parsing methods. Simple PDF parsers, while fast and cheap, often fail to capture the nuances of document structure, leading to incomplete or jumbled output. Tables become unreadable, images disappear entirely, and the overall document structure is destroyed. On the other hand, more powerful frontier models, while offering better quality and robustness, come with significant costs and can sometimes suffer from hallucinations, making their output less reliable and consistent.

Introducing Docling: A Middle Ground

To address these challenges, Clyburn introduced Docling, an open-source processor that utilizes advanced vision models and OCR to prepare documents for Gen AI. Docling aims to strike a balance by being cost-effective at scale, offering consistent representation, and maintaining high quality. It can parse multiple document formats, including PDF, DOCX, XLSX, HTML, and images, and provides advanced understanding of page layout, reading order, table structure, code, formulas, and image classification.

A key advantage highlighted was Docling's ability to operate locally, making it suitable for sensitive data and air-gapped environments, a critical consideration for many enterprises. Furthermore, its plug-and-play nature allows for seamless integration with common AI frameworks such as RAG, Agentic Systems, Dataset Generation, and Vector Databases.

Demonstrating Docling's Capabilities

Clyburn showcased Docling's capabilities through several demonstrations. He illustrated how a simple PDF parser struggles with complex layouts, losing table structure and image content. In contrast, he demonstrated how Docling successfully extracts tables, images, and text, converting them into structured formats like Markdown or JSON. The presentation also highlighted Docling's ability to extract image captions and descriptions, enriching the data for LLM consumption.

The comparison of Docling with other methods, particularly in terms of cost-effectiveness, was striking. Clyburn presented data suggesting Docling is 50 times more cost-effective than Vision-Language Models (VLMs) for certain tasks, emphasizing its efficiency for large-scale document processing.

The Docling Ecosystem and Future

The session concluded by emphasizing Docling's role as a tool that empowers AI agents by providing them with the ability to discover, access, and utilize document data. By leveraging the Model Context Protocol (MCP), Docling enables LLMs to interact with document tools autonomously, facilitating more sophisticated AI workflows. Clyburn encouraged the audience to explore the Docling project further, highlighting its open-source nature and the vibrant community around it.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.