Understanding the inner workings of neural networks is becoming paramount as AI systems are deployed in critical applications, necessitating robust methods for debugging and auditing. Mechanistic interpretability aims to achieve this by identifying specific subnetworks, or "circuits," responsible for particular model behaviors. However, existing approaches to circuit discovery have been criticized for their fragility, with discovered circuits often being highly dependent on the specific dataset used for their identification and failing to generalize to new data. This raises concerns that these methods might be capturing dataset-specific artifacts rather than genuine conceptual understanding within the model. Addressing this challenge, a new framework called Certified Circuits has been introduced, which offers provable stability guarantees for circuit discovery.
The Certified Circuits Framework
The core innovation of Certified Circuits lies in its ability to wrap any existing black-box circuit discovery algorithm. It achieves provable stability by employing randomized data subsampling. This technique ensures that decisions about including specific neurons as components of a circuit are invariant to bounded edit-distance perturbations of the concept dataset. In essence, if a neuron's inclusion in a circuit fluctuates significantly with minor changes to the input data, Certified Circuits will identify it as unstable and abstain from including it. This process leads to circuits that are not only more reliable but also more compact and accurate.