The current race to build ever more capable artificial intelligence models often prioritizes technical benchmarks over the nuanced reality of human interaction. Andrew Gordon, Staff Researcher in Behavioral Science, and Nora Petrova, AI Researcher, both from Prolific, contend that this focus creates a significant disconnect. In a recent interview, they meticulously dissected the flaws in conventional AI evaluation, advocating for a more "humane" approach to truly measure a model's utility, safety, and relatability for real people.
Gordon vividly illustrates this disconnect with an analogy: "A car that wins a Formula 1 race [is not] the best choice for your morning commute." Similarly, an AI model that achieves an "incredibly good" score on academic benchmarks like Humanity's Last Exam (MMLU) "might be absolute nightmare to use day-to-day." This encapsulates the core problem: technical prowess does not automatically translate to practical, beneficial human experience.
The landscape of AI evaluation is, as Gordon describes, "incredibly nascent" and "fractured." There is no standardized method for labs to report performance, leading to a cacophony of selective metrics. Some emphasize MMLU scores, others highlight different benchmarks, and some offer no public data at all. This heterogeneity makes meaningful comparison challenging and risks creating a "leaderboard illusion" where models are optimized for narrow, technical tests rather than broad human utility.
