Japanese tech giant Rakuten has deployed a novel AI guardrail system to detect and filter personally identifiable information (PII) from user messages, marking what is reportedly the first enterprise use of sparse autoencoders (SAEs) in a production safety system. It also demonstrates better efficiency and cost effectiveness on the task than LLM as a judge baselines, with performance on par.
The work, detailed in a new paper from AI firm Goodfire and Rakuten, tackles one of the biggest challenges in enterprise AI: protecting user privacy without being able to train models on real, sensitive user data.
The core problem is that AI models trained on synthetic data often struggle when faced with the messy, unpredictable nature of real-world user inputs. Goodfire’s research shows that a technique from the field of AI interpretability, called SAE probes, generalizes from synthetic to real data far more effectively than other methods.
Peeking Inside the AI's Brain
Instead of treating a model like a black box and just evaluating its final output, the new approach uses a smaller "sidecar" model (Llama 3.1 8B) and "probes" its internal activations—essentially peeking inside the model's reasoning process to see if it has identified a token as PII. SAEs take this a step further by first disentangling the model's complex internal state into a cleaner, more semantically meaningful set of features before the probe makes its classification.
The results are striking. According to the paper, this "white-box" SAE PII detection method achieved a 96% F1 score on the task. In contrast, using the very same Llama 8B model as a "black-box" judge—simply prompting it to find PII—yielded a score of just 51%.
This approach also proved dramatically more efficient than common industry practices. The researchers found their probing method was 10x to 500x cheaper than using large frontier models like GPT-5 Mini or Claude Opus 4.1 as judges, while delivering comparable accuracy.
For Rakuten, which serves over two billion customers worldwide, this means a faster, cheaper, and more robust way to safeguard user data in its AI agents. For the broader industry, it’s a powerful demonstration that interpretability research isn't just academic; it can be used to build more effective and efficient AI safety systems that solve critical, real-world business problems.


