"We've been using metrics designed for a different era, for signal fidelity, not for subjective human experience," remarked Diego Rodriguez, co-founder of Krea.ai, shedding light on a critical bottleneck in the advancement of generative AI. His observation cuts to the core of the challenge facing developers and researchers: how do you objectively measure something as inherently subjective as aesthetic quality or human perception in AI-generated media? This question formed the backbone of his recent discussion on perceptual evaluations.
Diego Rodriguez spoke in a special session about how Krea.ai is tackling the hardest kinds of evaluations—those for aesthetics and image/generative media. He highlighted that current AI evaluation methods often fall short, failing to capture the nuances of human perception. While metrics like PSNR or FID have served their purpose in assessing objective fidelity, they are poor proxies for how a human actually experiences an image or video.
The historical context of evaluation metrics, often rooted in image compression, explains this disconnect. Early metrics were developed to quantify how well a compressed image retained its original information, focusing on pixel-level accuracy. But as Rodriguez pointed out, "The challenge isn't just generating beautiful images; it's knowing *why* they're beautiful, and building systems that understand that." This distinction is paramount for generative models that aim to create novel, compelling content rather than merely reproduce existing data.
The limitations in current AI evaluation are profound. Relying on objective, technical metrics for inherently human-centric tasks means that AI models are often optimized for the wrong criteria. This can lead to systems that technically perform well but fail to resonate with users on a perceptual or emotional level. It's a fundamental flaw in the feedback loop for AI development.
Rethinking evaluation is not merely an academic exercise; it is essential for the future of AI. Without robust, perceptually aligned metrics, progress in areas like creative AI or immersive experiences risks being misdirected or stalled. If we cannot accurately assess the quality of our models from a human perspective, then we are, in essence, developing in the dark.
Krea.ai is stepping into this void, aiming to develop methodologies and tools that bridge the gap between objective measurement and subjective human perception. Their work is critical for an industry increasingly reliant on outputs that appeal directly to human senses and sensibilities. This effort extends beyond just developing new metrics; it involves "evaluating our evaluations," ensuring that the assessment tools themselves are valid and reliable. As Rodriguez emphasized, "If we can't properly evaluate our models, we're essentially flying blind in the most critical areas of generative AI development." Krea.ai is building the compass.

