The latest episode of IBM's "Mixture of Experts" podcast, hosted by Tim Hwang, convened a panel of AI thought leaders—Marina Danilevsky, Gabe Goodhart, and Merve Unuvar—to dissect the rapid advancements and inherent challenges in the artificial intelligence landscape. Their discussion centered on Google's recent release of Gemini 3 and the burgeoning field of AI agent innovation, particularly IBM's own CUGA framework, alongside a critical look at how we evaluate AI's real-world impact.
Google’s Gemini 3 has emerged with considerable fanfare, boasting "explosively good performance" on challenging benchmarks like "Humanity's Last Exam" and "Arc AGI," as noted by host Tim Hwang. These impressive scores, however, belie a more nuanced reality. Despite these leaps, Senior Research Scientist Marina Danilevsky observed that Gemini 3, much like its predecessors, "is still hallucinating and it still really likes to give answers rather than say that it doesn't know the answers." This persistent tendency to fabricate information, even in advanced models, underscores a fundamental limitation that benchmarks often overlook.
