As model architectures and compute become commoditized, the frontier for machine learning advancement lies squarely in data. Yet, the critical process of data engineering remains a manual, iterative, and often ad-hoc endeavor. This paper introduces a paradigm shift towards task-conditioned autonomous data engineering, a system designed to autonomously optimize the data side of the ML pipeline—from discovery and selection to cleaning and transformation—without altering the core learning algorithm. This approach aims to yield superior downstream solutions by treating data as a dynamic, optimizable component. To tackle the inherent challenges of open-ended search, complex dependencies, and delayed validation in this domain, the researchers propose DataMaster, a novel agent framework. Published on arXiv, this work details how DataMaster integrates a tree-structured search, a shared candidate data pool, and cumulative memory to navigate the complexities of autonomous data engineering.
DataMaster: A Framework for Intelligent Data Curation
DataMaster is architected around three core components: a DataTree for organizing and exploring diverse data-engineering pathways; a shared Data Pool that centralizes discovered external data sources for efficient reuse across different branches of exploration; and a Global Memory to meticulously record node outcomes, generated artifacts, and crucial reusable insights. This integrated system enables the autonomous agent to intelligently discover candidate data, construct executable training inputs, and critically, to leverage downstream feedback for continuous improvement. By carrying evidence across branches, DataMaster avoids redundant efforts and accelerates the discovery of optimal data configurations.