AI Archives: Water Data Gets Searchable

Databricks uses multimodal AI to turn Sudan's scanned water archives into a searchable database for critical groundwater discovery.

May 11 at 9:01 PM3 min read

Illustration of a map with water droplet icons and AI nodes connecting data points. — AI is making historical water data accessible for critical discovery.

Groundwater discovery is a complex challenge, especially in regions like Sudan where communities rely on it for survival. Decades of geological surveys and field reports hold vital data, but remain locked away in unorganized archives. MapAid, a non-profit focused on AI-enhanced mapping for humanitarian aid, partnered with Databricks for Good to unlock this information.

The initiative transformed nearly 700 scanned hydrogeological documents into a searchable database, a crucial step for MapAid's WellMapr app, which guides low-cost well drilling. The project leveraged multimodal AI for document analysis, turning static archives into an actionable search engine.

Related startups

Visualizing Old Documents

The archive presented significant hurdles: scanned documents, some decades old, lacked embedded text. Pages were skewed, contained mixed languages (English and Arabic), and included handwritten notes. Traditional OCR was insufficient.

The team reframed the problem as visual understanding. Scanned page images were fed directly into multimodal AI models. This approach, detailed on the Databricks blog, allowed the AI to interpret content visually.

Pages were rendered as images and stored in Unity Catalog Volumes. An intelligent sampling strategy reduced processing costs by over 70%, focusing on key sections of longer documents.

Databricks AI Functions were used to analyze each sampled page. The model identified Dewey Decimal codes, referenced Sudanese geographies, and flagged pages relevant to water resources.

This enabled a structured, searchable catalog where each document was tagged by subject and location.

Extracting Structured Well Data

For water-relevant documents, the pipeline processed every page to extract structured well and borehole records. This data is critical for MapAid's groundwater prediction models.

OCR was performed using a multimodal model capable of handling English, Arabic, complex layouts, and even handwritten notes. Entity recognition identified well identifiers to link records across multiple pages.

The extracted text was unified, and a second pass extracted structured data like site names, GPS coordinates, drilling depths, and pump test yields. Databricks AI Functions ensured consistent schema enforcement.

The result is a dataset ready for direct integration into MapAid's WellMapr app.

Automated Quality Control

Manual validation of thousands of classifications would be prohibitive. The pipeline incorporated automated quality evaluation as a core stage.

A separate AI model scored each classification on accuracy, completeness, and consistency. It compared assigned tags against page content, providing a categorical rating and a written justification.

Documents below a confidence threshold were flagged for manual review, optimizing human effort. This ensured high confidence in the automated results, with a small fraction requiring human attention in the initial run.

A Unified Solution on Databricks

The entire process, from file storage to AI inference and governance, was managed within the Databricks platform. Raw files were stored in Unity Catalog Volumes, and outputs were written to Delta Lake tables.

The pipeline runs on serverless compute, with costs tied to consumption. The system is packaged as a Databricks Asset Bundle, allowing single-command deployment and updates.

This self-contained solution simplifies maintenance, especially for organizations without extensive cloud expertise.

The methodology is designed to be adaptable to other water archives, regions, or domains dealing with unstructured document analysis.

The initial run classified 654 documents (5,570 pages) in under three hours, with 95% of classifications rated highly by the automated evaluator. Approximately 50% of the archive contained water data, yielding 299 structured well and borehole records.

This significantly reduced the time required for data access, transforming weeks of manual work into hours of automated processing.

The extracted data directly feeds MapAid's groundwater predictions, improving drilling success rates and accelerating water delivery to communities. As new documents are digitized, the pipeline can process them efficiently, keeping the catalog current.

MapAid plans to scale this approach across East Africa, where similar unclassified archives exist. This project demonstrates the power of unstructured document analysis through advanced AI to address critical humanitarian needs.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Databricks #AI #Machine Learning #Multimodal AI #Document Analysis #Groundwater #Humanitarian Aid #Sudan #Data Science #OCR