Anna Marie Benzon, a PhD candidate in Artificial Intelligence from the University of the Philippines Diliman, presented a novel approach to automating the detection and remediation of ETL pipeline failures. Her work introduces a Reinforcement Learning (RL) agent designed to enhance the reliability and efficiency of data processing workflows. Traditional methods often involve extensive manual log inspection, schema tracing, and delayed dashboards, leading to a Mean Time To Recovery (MTTR) of approximately 2.5 working days. Benzon's solution aims to transform this reactive debugging process into intelligent recovery, making routine ETL failures diagnosable, explainable, and recoverable in minutes.
Related startups
The Problem: The Cost of ETL Failures
Cloud ETL jobs frequently break due to a variety of issues, including late or unavailable source data, schema drift, datetime parsing incompatibilities, null-rate spikes, type changes, and unknown runtime errors. The manual process of identifying, diagnosing, and fixing these failures is time-consuming and resource-intensive. Benzon highlighted that the manual workflow typically involves failure, inspection of logs, diagnosis, repair, rerun, and validation, a cycle that can take considerable time. The core challenge is not just detecting a failure, but doing so effectively and remediating it within operational boundaries that a human team can trust.
The Solution: An End-to-End RL Pipeline Health Agent
Benzon detailed the architecture of an end-to-end RL pipeline health agent. The system monitors ETL jobs, diagnoses failures, scores their operational risk, decides on an appropriate action, ensures safety, acts to remediate, and validates the outcome. The architecture involves components like AWS CloudWatch Logs, Amazon EventBridge, AWS Lambda, a data catalog, and the RL agent itself, interacting with services like AWS Glue and Amazon S3. The RL agent receives state information such as failure category, risk level, retry count, drift severity, and data quality conditions. Based on this state, it selects an action from a set including retry, coerce schema, rollback, quarantine, escalate, or log. The agent utilizes tabular Q-learning, a small, interpretable state space, low-memory inference, and inspectable Q-values for each decision, making it a practical and understandable solution.
