LLMs Plan, But Do They Plan Safely?

The Safety Deficit in Scaled Planning

Across 18 open-source models, ranging from 3B to 671B parameters, a clear trend emerged: planning ability scales dramatically with model size, improving from 0.4% to 99.3%. However, safety awareness remained largely stagnant, hovering between 38% and 57%. Crucially, the researchers identified a multiplicative relationship where larger models achieve more task completions primarily through enhanced planning prowess, not through superior danger avoidance. This indicates that scaling alone is insufficient to instill safety consciousness.

Proprietary Models Lead the Safety Charge

A notable divergence appears with proprietary models. Those incorporating advanced reasoning capabilities achieved significantly higher safety awareness, registering between 71% and 81%. In contrast, proprietary models without explicit reasoning components and the open-source reasoning models performed below the 57% safety threshold. This gap underscores the current limitations of open-source LLMs in achieving robust robotic safety and points towards a promising direction for future research in developing a comprehensive LLM robotic safety benchmark.

The Next Frontier: Safety Awareness

As frontier LLMs approach saturation in planning capabilities, the paramount challenge for deploying them in robotic systems shifts decisively towards improving safety awareness. The DESPITE benchmark provides a rigorous framework for evaluating this critical aspect, highlighting the need for novel architectures and training methodologies focused explicitly on preventing dangerous actions. Addressing this deficit is essential for unlocking the full potential of LLMs in safety-critical robotic applications.

LLMs Plan, But Do They Plan Safely?

The Safety Deficit in Scaled Planning

Related startups

Proprietary Models Lead the Safety Charge

The Next Frontier: Safety Awareness

AI Daily Digest