LLMs Plan, But Do They Plan Safely?

New LLM robotic safety benchmark, DESPITE, finds scale boosts planning but not safety. Proprietary models lead, revealing a critical gap for safe robotic deployment.

2 min read
Diagram illustrating diverse robotic tasks and potential dangers evaluated by the DESPITE benchmark.
The DESPITE benchmark assesses LLM-driven robotic planning across physical and normative dangers.

The increasing integration of large language models (LLMs) into robotic planning systems hinges on a critical, yet largely unaddressed, question: how safely do they plan? A new benchmark, DESPITE, comprising 12,279 tasks, reveals a stark dichotomy between planning capability and safety adherence across 23 evaluated models. The findings from arXiv suggest that even models demonstrating near-perfect task completion falter significantly on safety, a crucial oversight for real-world deployment.

The Safety Deficit in Scaled Planning

Across 18 open-source models, ranging from 3B to 671B parameters, a clear trend emerged: planning ability scales dramatically with model size, improving from 0.4% to 99.3%. However, safety awareness remained largely stagnant, hovering between 38% and 57%. Crucially, the researchers identified a multiplicative relationship where larger models achieve more task completions primarily through enhanced planning prowess, not through superior danger avoidance. This indicates that scaling alone is insufficient to instill safety consciousness.

Related startups

Proprietary Models Lead the Safety Charge

A notable divergence appears with proprietary models. Those incorporating advanced reasoning capabilities achieved significantly higher safety awareness, registering between 71% and 81%. In contrast, proprietary models without explicit reasoning components and the open-source reasoning models performed below the 57% safety threshold. This gap underscores the current limitations of open-source LLMs in achieving robust robotic safety and points towards a promising direction for future research in developing a comprehensive LLM robotic safety benchmark.

The Next Frontier: Safety Awareness

As frontier LLMs approach saturation in planning capabilities, the paramount challenge for deploying them in robotic systems shifts decisively towards improving safety awareness. The DESPITE benchmark provides a rigorous framework for evaluating this critical aspect, highlighting the need for novel architectures and training methodologies focused explicitly on preventing dangerous actions. Addressing this deficit is essential for unlocking the full potential of LLMs in safety-critical robotic applications.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.