The perceived robustness of instruction-tuned large language models (LLMs) falters dramatically under seemingly minor lexical restrictions. Researchers have demonstrated that even banning a single punctuation mark or common word can lead to a significant collapse in response comprehensiveness, ranging from 14% to 48% across multiple model families, including GPT-4o-mini. This indicates a fundamental brittleness in how these models process and generate information when faced with simple surface-form challenges.
Instruction Tuning Creates Fragile Surface-Form Dependencies
A core finding is that this collapse is a direct artifact of instruction tuning. Base models, when subjected to identical constraints, show no systematic degradation; their responses remain largely unaffected, exhibiting only small, noisy, and bidirectional effects. In stark contrast, instruction-tuned models link task competence to narrow surface-form templates. This coupling is so profound that GPT-4o-mini, a commercially deployed model, suffers a 31% comprehensiveness loss with a 99% baseline win rate in pairwise comparisons, underscoring that the fragility of GPT-4o mini constraints is a real-world issue.