Artificial Intelligence

Preferred on Google

Red Hat's Cedric Clyburn on Structuring Unstructured Data for AI

Red Hat's Cedric Clyburn discusses the challenges of unstructured data for AI and introduces Docling, an open-source solution for advanced document parsing.

Jun 28 at 5:02 PM7 min read

Presentation slide with the title 'Structuring the Unstructured: Advanced Document Parsing for AI Workflows' and speaker information. — AI Engineer

In the rapidly evolving AI landscape, the ability to effectively process and structure unstructured data is paramount. Cedric Clyburn, a Senior Developer Advocate at Red Hat, recently shared insights into this challenge and introduced a promising open-source solution called Docling. In his presentation, "Structuring the Unstructured: Advanced Document Parsing for AI Workflows," Clyburn highlighted the pervasive nature of unstructured data and the limitations of current tools, advocating for a more robust and efficient approach.

Red Hat's Cedric Clyburn on Structuring Unstructured Data for AI - AI Engineer — Red Hat's Cedric Clyburn on Structuring Unstructured Data for AI — from AI Engineer

Visual TL;DR. Unstructured Data Volume leads to Current Solutions Limited. Current Solutions Limited introduces Introducing Docling. Introducing Docling shows Docling's Capabilities. Docling's Capabilities enables AI Data Structuring. Introducing Docling part of Docling Ecosystem. AI Data Structuring results in Future AI Workflows.

Related startups

Unstructured Data Volume: 85% of world's data is unstructured, needs AI processing
Current Solutions Limited: Existing tools struggle to transform data for LLMs
Introducing Docling: Open-source solution for advanced document parsing
Docling's Capabilities: Demonstrates advanced parsing for AI workflows
Docling Ecosystem: Building a community around document parsing
AI Data Structuring: Enables LLMs to understand and utilize diverse data
Future AI Workflows: More efficient and robust AI processing

Visual TL;DRQuickExplainDeeper

The Challenge of Unstructured Data

Clyburn began by emphasizing the sheer volume of unstructured data, stating that 85% of the world's data exists in these formats. This data, ranging from PDFs and presentations to contracts and technical documents, needs to be transformed into a format that Large Language Models (LLMs) can readily understand and utilize. He pointed out that while many AI applications and agents are popular for their ability to extract value from data, they often struggle with the raw, unstructured formats that dominate enterprise data.

Limitations of Existing Solutions

The presentation touched upon the shortcomings of current document parsing methods. Simple PDF parsers, while fast and cheap, often fail to capture the nuances of document structure, leading to incomplete or jumbled output. Tables become unreadable, images disappear entirely, and the overall document structure is destroyed. On the other hand, more powerful frontier models, while offering better quality and robustness, come with significant costs and can sometimes suffer from hallucinations, making their output less reliable and consistent.

Introducing Docling: A Middle Ground

To address these challenges, Clyburn introduced Docling, an open-source processor that utilizes advanced vision models and OCR to prepare documents for Gen AI. Docling aims to strike a balance by being cost-effective at scale, offering consistent representation, and maintaining high quality. It can parse multiple document formats, including PDF, DOCX, XLSX, HTML, and images, and provides advanced understanding of page layout, reading order, table structure, code, formulas, and image classification.

A key advantage highlighted was Docling's ability to operate locally, making it suitable for sensitive data and air-gapped environments, a critical consideration for many enterprises. Furthermore, its plug-and-play nature allows for seamless integration with common AI frameworks such as RAG, Agentic Systems, Dataset Generation, and Vector Databases.

Demonstrating Docling's Capabilities

Clyburn showcased Docling's capabilities through several demonstrations. He illustrated how a simple PDF parser struggles with complex layouts, losing table structure and image content. In contrast, he demonstrated how Docling successfully extracts tables, images, and text, converting them into structured formats like Markdown or JSON. The presentation also highlighted Docling's ability to extract image captions and descriptions, enriching the data for LLM consumption.

The comparison of Docling with other methods, particularly in terms of cost-effectiveness, was striking. Clyburn presented data suggesting Docling is 50 times more cost-effective than Vision-Language Models (VLMs) for certain tasks, emphasizing its efficiency for large-scale document processing.

The Docling Ecosystem and Future

The session concluded by emphasizing Docling's role as a tool that empowers AI agents by providing them with the ability to discover, access, and utilize document data. By leveraging the Model Context Protocol (MCP), Docling enables LLMs to interact with document tools autonomously, facilitating more sophisticated AI workflows. Clyburn encouraged the audience to explore the Docling project further, highlighting its open-source nature and the vibrant community around it.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Cedric Clyburn #Red Hat #Docling #Unstructured Data #AI Workflows #Document Parsing #LLM #RAG #Open Source #Data Preparation

AI Daily Digest

Get the most important AI news daily.

+40k readers