Databricks is tackling the challenge of accurately evaluating AI-generated machine learning code with a new approach leveraging MemAlign, an open-source framework integrated into MLflow. This is crucial for ensuring the quality and reliability of outputs from tools like Databricks' Genie Code, which generates full ML notebooks from natural language prompts.
Evaluating traditional ML notebooks is complex, requiring assessment of code quality, adherence to best practices, and data-informed tailoring. Databricks initially created nine AI judges, trained to score notebooks across nine dimensions like library installation, data imputation, and model training. However, human experts found significant discrepancies, with AI judges disagreeing with human evaluations by up to 0.68 MAE on a 3-point scale.
Bridging the Expert Gap
The core issue was misalignment: AI judges and humans interpreted scoring rubrics differently. LLMs often missed subtle technical nuances and exhibited a positivity bias, hindering objective assessment.