OpenAI Tackles AI Agent 'Prompt Injection'

AI agents are increasingly capable of browsing the web and acting on behalf of users, a power that also opens new avenues for manipulation. These attacks, known as prompt injection, involve embedding instructions in external content to make an AI model deviate from its intended task. OpenAI researchers note that the most effective real-world attacks now resemble social engineering more than simple command overrides.

Early prompt injection attacks were straightforward, like altering Wikipedia entries to direct AI agents. As models grew more sophisticated, these attacks evolved. OpenAI shared an example of a 2025 attack where an email instructed an AI assistant to extract and process employee data, including personally identifiable information, with the assistant having 'full authorization' to retrieve and submit data to compliance systems. Such advanced attacks often bypass traditional input filtering, making defense a complex challenge.

Social Engineering as a Defense Model

OpenAI's approach to securing AI agents against prompt injection draws parallels to managing social engineering risks in human interactions. Instead of solely focusing on identifying malicious inputs, the strategy emphasizes designing agents and systems that constrain the impact of manipulation, even if an attack succeeds. This mirrors how customer service agents have limitations to prevent misuse, despite potential exposure to misleading customers.

This mindset informs OpenAI's countermeasures in ChatGPT. The company combines social engineering principles with traditional security methods like source-sink analysis, identifying how untrusted external content can combine with an agent's capabilities to perform dangerous actions.

Defending ChatGPT

A core security tenet for ChatGPT is preventing dangerous actions or sensitive data transmissions from occurring silently. Attacks often aim to exfiltrate conversation data and send it to a malicious third party. While safety training frequently prevents these actions, OpenAI has developed mitigations like 'Safe URL' for cases where data might be transmitted externally.

Safe URL detects when an assistant might transmit learned information to a third party. In such instances, ChatGPT either prompts the user for confirmation or blocks the action, instructing the agent to find an alternative method. Similar safeguards are in place for navigation features and within sandboxed environments for ChatGPT Canvas and Apps, which monitor for unexpected communications and require user consent.

Safe interaction with the adversarial external world is necessary for fully autonomous agents.

OpenAI continues to research AI model vulnerabilities to social engineering and refine its defenses, integrating these findings into both its application security architectures and AI model training. The company suggests that as AI models become more intelligent, they should ideally resist social engineering better than humans, though practical implementation varies by application.

OpenAI Tackles AI Agent 'Prompt Injection'

Related startups

Social Engineering as a Defense Model

Defending ChatGPT

AI Daily Digest