The impressive performance of large language models on standard code generation benchmarks masks a critical vulnerability: a reliance on memorization over genuine reasoning. Frontier models, achieving near-ceiling scores on familiar tasks, falter dramatically when presented with challenges that deviate from their pre-training data.
The Esoteric Language Divide
To probe this limitation, researchers introduced EsoLang-Bench, a novel benchmark utilizing five esoteric programming languages: Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare. These languages, chosen for their economic irrationality in pre-training data acquisition (possessing 1,000x to 100,000x fewer public repositories than Python), demand the same fundamental computational primitives as mainstream languages but are unlikely to be present in sufficient quantities in typical training sets. This arXiv preprint reveals a stark capability gap: models scoring 85-95% on standard benchmarks plummeted to 0-11% on equivalent esoteric tasks, with zero accuracy beyond the 'Easy' tier. Crucially, few-shot learning and self-reflection techniques offered no improvement, underscoring that these models are exploiting training priors rather than demonstrating flexible learning.