"It turns out, rather concerningly, that nobody really knows the answer" to what's truly happening inside large language models, stated Stuart Ritchie of Anthropic Research Communications. This candid admission set the stage for a compelling discussion with fellow Anthropic researchers Josh Batson, Emmanuel Ameisen, and Jack Lindsey, as they delved into the burgeoning field of AI interpretability, aiming to demystify the complex "thinking" processes of models like Claude.
The prevailing analogy for understanding large language models often reduces them to "glorified autocompletes." However, this simplification misses the profound complexity of their internal mechanisms. As Josh Batson aptly put it, these models are not programmed with explicit "if the user says hi, you should say hi" rules. Instead, they are "trained" through an evolutionary process, where "a whole lot of data... goes in and the model... starts out being really bad at saying anything and then its inside parts get tweaked." This organic evolution means the final AI bears "little resemblance to what it started as," akin to a biological organism whose ultimate purpose (survival and reproduction) is achieved through myriad, often opaque, internal processes.
