AI agents are increasingly capable of browsing the web and acting on behalf of users, a power that also opens new avenues for manipulation. These attacks, known as prompt injection, involve embedding instructions in external content to make an AI model deviate from its intended task. OpenAI researchers note that the most effective real-world attacks now resemble social engineering more than simple command overrides.
Early prompt injection attacks were straightforward, like altering Wikipedia entries to direct AI agents. As models grew more sophisticated, these attacks evolved. OpenAI shared an example of a 2025 attack where an email instructed an AI assistant to extract and process employee data, including personally identifiable information, with the assistant having 'full authorization' to retrieve and submit data to compliance systems. Such advanced attacks often bypass traditional input filtering, making defense a complex challenge.