"You can only manage what you can measure," a timeless dictum, resonates profoundly in the rapidly evolving landscape of artificial intelligence. This core principle underpinned the insightful discussion between Jeff Huber, CEO and co-founder of Chroma, and Jason Liu, a seasoned machine learning engineer, at the AI Engineer World's Fair in San Francisco. Their joint presentation illuminated practical strategies for AI practitioners to systematically understand and improve their applications by meticulously examining both input data and system outputs.
Huber began by addressing the critical challenge of evaluating retrieval systems, cautioning against reliance on mere guesswork or expensive, time-consuming Large Language Model (LLM) as-judge evaluations. He championed "Fast Evals," a method centered on creating "query and document pairs: if this query, then this document." This approach generates a "golden dataset" that allows for rapid, inexpensive experimentation. Crucially, Huber noted that LLMs can even be employed to generate synthetic queries, but warned against overly clean, non-realistic queries that might mislead developers. He shared a compelling finding from a joint report with Weights & Biases: "The original embedding model used for this application was actually text-embedding-3-small. This actually performed the worst." This demonstrated that popular public benchmarks and widely discussed models on platforms like X (formerly Twitter) may not be optimal for specific, real-world data, underscoring the necessity of custom evaluation.
