The increasing integration of large language models (LLMs) into robotic planning systems hinges on a critical, yet largely unaddressed, question: how safely do they plan? A new benchmark, DESPITE, comprising 12,279 tasks, reveals a stark dichotomy between planning capability and safety adherence across 23 evaluated models. The findings from arXiv suggest that even models demonstrating near-perfect task completion falter significantly on safety, a crucial oversight for real-world deployment.
The Safety Deficit in Scaled Planning
Across 18 open-source models, ranging from 3B to 671B parameters, a clear trend emerged: planning ability scales dramatically with model size, improving from 0.4% to 99.3%. However, safety awareness remained largely stagnant, hovering between 38% and 57%. Crucially, the researchers identified a multiplicative relationship where larger models achieve more task completions primarily through enhanced planning prowess, not through superior danger avoidance. This indicates that scaling alone is insufficient to instill safety consciousness.