DataMaster: Autonomous Data Engineering

As model architectures and compute become commoditized, the frontier for machine learning advancement lies squarely in data. Yet, the critical process of data engineering remains a manual, iterative, and often ad-hoc endeavor. This paper introduces a paradigm shift towards task-conditioned autonomous data engineering, a system designed to autonomously optimize the data side of the ML pipeline—from discovery and selection to cleaning and transformation—without altering the core learning algorithm. This approach aims to yield superior downstream solutions by treating data as a dynamic, optimizable component. To tackle the inherent challenges of open-ended search, complex dependencies, and delayed validation in this domain, the researchers propose DataMaster, a novel agent framework. Published on arXiv, this work details how DataMaster integrates a tree-structured search, a shared candidate data pool, and cumulative memory to navigate the complexities of autonomous data engineering.

Visual TL;DR+ Explain− Collapse

DataMaster: A Framework for Intelligent Data Curation

DataMaster is architected around three core components: a DataTree for organizing and exploring diverse data-engineering pathways; a shared Data Pool that centralizes discovered external data sources for efficient reuse across different branches of exploration; and a Global Memory to meticulously record node outcomes, generated artifacts, and crucial reusable insights. This integrated system enables the autonomous agent to intelligently discover candidate data, construct executable training inputs, and critically, to leverage downstream feedback for continuous improvement. By carrying evidence across branches, DataMaster avoids redundant efforts and accelerates the discovery of optimal data configurations.

Quantifiable Performance Leaps via Data Optimization

The efficacy of DataMaster is demonstrated through rigorous evaluation on established benchmarks. On the MLE-Bench Lite, the system achieved a notable 32.27% improvement in medal rate compared to the initial score, underscoring its ability to significantly enhance model performance through data-centric optimization. Furthermore, on the PostTrainBench, DataMaster surpassed the performance of an instruct model on the GPQA task, achieving a score of 31.02% versus 30.35%. These results highlight the substantial impact of sophisticated, autonomous data engineering on achieving state-of-the-art results, suggesting a future where data pipeline optimization is as critical as algorithmic innovation.

DataMaster: Autonomous Data Engineering

DataMaster: A Framework for Intelligent Data Curation

Related startups

Quantifiable Performance Leaps via Data Optimization

AI Daily Digest