Tim Hwang, host of IBM's "Mixture of Experts" podcast, recently convened a panel of IBM Senior Research Scientists Marina Danilevsky and Nathalie Baracaldo, alongside AI Research Engineer Sandi Besen, to dissect critical developments in artificial intelligence. Their discussion spanned the sobering reality of generative AI pilots, the revelation of a hidden prompt within GPT-5, and inherent flaws in large reasoning models. The conversation painted a picture of an industry grappling with misaligned expectations and the profound implications of AI's increasing autonomy.
A recent MIT Nanda Initiative report casts a stark shadow on the enterprise AI landscape, revealing that an astonishing 95% of generative AI pilots are falling short of expectations. This figure, as host Tim Hwang notes, indicates that initial deployments are "not really anywhere near the expectations of the people implementing them." Sandi Besen, while acknowledging the headline-grabbing nature of such a number, expressed a desire for deeper context on the study's methodology, particularly regarding how ROI was measured and who was surveyed. She found the 95% figure "too high for what I think the capabilities of this technology is," suggesting a fundamental disconnect.
Marina Danilevsky pinpointed this disconnect precisely: "There continues to be a misalignment of expectations... between leaders and maybe C-suite executives and what they have been maybe seeing through some marketing, some really specific demos... and what ends up actually happening." This stark gap between perception and reality often leads to the failure of AI initiatives.
Further complicating the pursuit of effective AI, the panel delved into the discovery of a "shadow system prompt" within GPT-5, operating beyond user-editable parameters. Tim Hwang highlighted the unease this creates for developers who desire full transparency, stating, "If I am using a model through an API, I want to know everything that's going through the model." Sandi Besen conceded that such hidden layers are "to be expected" within the architecture of AI frameworks, yet stressed the developer's need for transparency to understand a model's alignment and behavior. Marina Danilevsky delivered a pointed warning: "These models should never be deployed as a serious application naked. Put some clothes on." This underscores the critical need for developers to fully understand and "clothe" their AI applications with robust, transparent controls, rather than blindly trusting external model providers.
The discussion then shifted to the reliability of large reasoning models, referencing a paper titled "Large Reasoning Models are Not Thinking Straight." Tim Hwang summarized the paper's findings, noting instances where models either "engage in all sorts of change of thought that aren't very productive" or "prematurely disengage from promising chains of thought." Sandi Besen observed that the models tested, all distilled from DeepSeek R1, often "think for a really long time before it settles on an answer," raising questions about efficiency and token expenditure.
Marina Danilevsky offered a crucial perspective, asserting that "chain of thought is not the reasoning part, it's sort of a post-hoc approximation." She drew a parallel to human decision-making, where our verbalized reasoning doesn't always perfectly reflect our internal cognitive processes. This suggests that attributing human-like "thinking" to AI's chain-of-thought outputs can be misleading. The experts collectively stressed the importance of focusing on practical, problem-solving applications of AI, rather than succumbing to the hype that often precedes real-world utility. Nathalie Baracaldo, embracing the challenges, affirmed, "When something doesn't work, I get really excited because it means that we can make it work." The evolving landscape of AI demands continuous iteration and a pragmatic approach to integrate these powerful tools effectively within enterprise environments.

