In the burgeoning era of artificial intelligence, a crucial question arises: "How can you evaluate all of the text that AI spits out?" IBM's Zahra Ashktorab tackles this question in a recent video, exploring how Large Language Models (LLMs) can be leveraged to judge the outputs of other LLMs, a concept known as "LLM-as-a-judge."
Ashktorab spoke about LLM evaluation strategies at IBM's Think Series, focusing on the benefits and drawbacks of using AI to assess AI. This approach, she argues, offers a scalable alternative to traditional metrics and manual labeling, which can be time-consuming and may not always be suitable for the task at hand.
A core insight is the efficiency gained through automation. As Ashktorab notes, if you've ever manually tried labeling hundreds of outputs, whether it be chatbot replies or summaries, you know that it's a lot of work. The "LLM-as-a-judge" paradigm offers two primary strategies for reference-free evaluation: direct assessment and pairwise comparison. Direct assessment involves designing a rubric against which outputs are evaluated, while pairwise comparison asks the model to choose the better option between two outputs.
The video illuminates that roughly half of the participants preferred direct assessment for their ability to be clear and have control over their rubric. Direct assessment hinges on designing a rubric. For instance, when evaluating summaries, one might ask, "Is the summary clear and coherent?" and provide options such as "yes" or "no." Each output is then evaluated against this rubric.
Flexibility emerges as another key advantage. Traditional evaluation methods tend to be rigid. LLM-as-a-judge offers the ability to refine prompts and remain flexible in one's evaluations. This is particularly important as criteria may shift with increased data exposure.
However, Ashktorab cautions against potential biases, emphasizing that just like humans, LLMs have their blind spots. She highlights positional bias, where an LLM may favor an output simply because of its position, and verbosity bias, where longer outputs are favored regardless of content quality. "There's also the case where a model might favor an output because it recognizes that it created the output." Known as self-enhancement bias.
Despite these challenges, Ashktorab concludes that LLM-as-a-judge offers a promising avenue for scalable, transparent, and nuanced evaluation. "If you're tired of manually evaluating output, LLM as a judge might be a good option for scalable, transparent and nuanced evaluation".

