LLM Evaluators: Beyond Naive Judgments

Mahmoud Malaeb of Argenta discusses the limitations of naive LLM judges and introduces GEPA, an optimization framework for building more accurate LLM evaluators using a data flywheel approach.

4 min read
LLM Evaluators: Beyond Naive Judgments
AI Engineer

In the rapidly evolving world of AI agents, ensuring their reliability and adherence to complex policies is paramount. This challenge was the focus of a recent presentation by Mahmoud Malaeb, co-founder and CEO of Argenta, an LLM Ops platform. Malaeb discussed the critical need for robust LLM evaluation, particularly for agents operating in customer-facing roles, using the example of an airline customer service agent.

LLM Evaluators: Beyond Naive Judgments - AI Engineer
LLM Evaluators: Beyond Naive Judgments — from AI Engineer

The Problem with Naive LLM Judges

Malaeb began by highlighting the shortcomings of 'naive' LLM judges. These are systems that often provide a simple 'compliant' or 'non-compliant' verdict without deep understanding of specific policies or context. He illustrated this with a scenario where an LLM judge might incorrectly label an agent as compliant simply because the interaction appears polite or handles basic details, failing to identify subtle policy violations.

A key issue identified is the difficulty LLMs face in understanding and applying nuanced policy rules. For instance, an agent might offer compensation before a customer explicitly requests it, which could be a violation of policy, but a naive judge might overlook this if the interaction seems otherwise smooth. Similarly, agents might act on unverified customer claims, a behavior that requires specific policy knowledge to flag as problematic.

Related startups

The GEPA Approach: Optimizing LLM Judges

To address these limitations, Malaeb introduced GEPA (Generalized Evolutionary Prompt Architect), an open-source framework designed to optimize LLM judges. GEPA operates through a cyclical process:

  • Evaluating the current rubric: The process begins by evaluating existing prompts and rubrics against a batch of training examples.
  • Showing failures with annotations: Failures are identified and annotated to provide specific feedback.
  • Reflection and new prompt proposal: The LLM uses this feedback to propose an improved rubric or prompt.
  • Testing and iteration: The new rubric is tested, and if improvements are seen, it's kept; otherwise, it's discarded. This iterative process continues to refine the judge's performance.

The core idea is to create a data flywheel where the annotations serve as training signals, allowing the LLM to learn and adapt its evaluation criteria over time. This iterative refinement is crucial for building LLM judges that can accurately assess complex agent behaviors.

The Data and Workflow

Malaeb showcased the application of GEPA using the Tau-2 benchmark dataset, which contains simulated customer service interactions. The dataset includes conversation traces, tool usage, and ground truth assertions about agent behavior. For the specific use case of an airline customer support agent, the data involved interactions related to managing reservations, accessing flight information, and handling complex policies.

The workflow involves several key steps:

  • Metric Design: Defining relevant metrics is critical, with the best metrics often derived from the business use case itself.
  • Annotation: Creating detailed annotations for each trace, specifying compliance and the reasoning behind it. This step is crucial for training the LLM judge.
  • Optimization: Applying the GEPA algorithm to iteratively refine the prompt or rubric based on the annotated data.
  • Validation: Evaluating the performance of the optimized LLM judge on unseen data to ensure its effectiveness.

Baseline Performance and What Went Wrong

Malaeb presented baseline results from a naive judge, showing a low accuracy of 61.6% and a significant bias towards 'compliant' verdicts, with non-compliant recall at a mere 2.3%. This highlights the inadequacy of simple, uncalibrated judges.

The analysis revealed that the naive judge's failures stemmed from its inability to understand policy nuances. It could not identify violations such as cancelling a reservation without verifying criteria or offering compensation without being asked. The system's lack of specific policy knowledge led to misclassifications, underscoring the need for a more sophisticated approach.

The Power of GEPA

The GEPA framework, by incorporating reflection and reasoning into the optimization process, aims to overcome these limitations. It allows for the systematic improvement of LLM judges by learning from failures and iteratively refining the evaluation criteria. This approach is not limited to prompt optimization but can also be applied to other algorithmic parameters, making it a versatile tool for enhancing LLM performance in complex, real-world scenarios.

Malaeb concluded by emphasizing that building effective LLM evaluators is an ongoing process, driven by data and iterative refinement. GEPA provides a structured way to achieve this, moving beyond naive judgments to create more reliable and accurate AI agents.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.