LLM Fragility Under Lexical Constraints

Instruction Tuning Creates Fragile Surface-Form Dependencies

A core finding is that this collapse is a direct artifact of instruction tuning. Base models, when subjected to identical constraints, show no systematic degradation; their responses remain largely unaffected, exhibiting only small, noisy, and bidirectional effects. In stark contrast, instruction-tuned models link task competence to narrow surface-form templates. This coupling is so profound that GPT-4o-mini, a commercially deployed model, suffers a 31% comprehensiveness loss with a 99% baseline win rate in pairwise comparisons, underscoring that the fragility of GPT-4o mini constraints is a real-world issue.

Planning Failures Underpin Response Collapse

Mechanistic analysis points to a planning failure as the root cause. The models struggle to adapt their generation strategy when a constraint is introduced, leading to a loss of comprehensiveness. While a two-pass generation approach (free generation followed by constrained rewriting) can recover a significant portion of the response length (59%-96%), the underlying issue stems from the instruction-tuning process itself. Linear probes on prompt representations reveal that instruction tuning builds a representational structure that encodes this 'collapse decision' before generation even begins, a phenomenon absent in base models.

Standard Evaluation Methods Miss Critical Degradation

Compounding the problem, standard LLM-as-judge evaluation methods are critically inadequate for assessing performance under constraints. These methods detect only a minimal 3.5% average quality drop, while pairwise evaluations reveal a much more severe 23% degradation. This significant discrepancy highlights a major methodological blind spot in how the robustness and quality of constrained generation are currently being assessed, potentially leading to an overestimation of model capabilities in real-world applications where such constraints might inadvertently arise.

LLM Fragility Under Lexical Constraints

Instruction Tuning Creates Fragile Surface-Form Dependencies

Related startups

Planning Failures Underpin Response Collapse

Standard Evaluation Methods Miss Critical Degradation

AI Daily Digest