Black-Box Access is Insufficient for Rigorous AI Audits

A core part of Apollo’s vision is doing interpretability-based AI model evaluations. We want to both establish that an AI system shows a specific behavior and also understand why. While current AI model evaluations are often black-box, i.e. they only investigate the model's input-output behavior, we think rigorous safety evaluations should include white-box techniques, which are methods that involve looking inside the model and understanding why it made better decisions; such methods give greater confidence in claims of a model’s safety. 

We were delighted to play a part in the collaborative research for the paper “Black-Box Access is Insufficient for Rigorous AI Audits”. In a joint effort with researchers from MIT, Harvard University, Centre for the Governance of AI, Stanford University, University of Toronto, and Mila, we make the case that rigorous AI audits require more than black-box access. The paper has been accepted for publication at FAccT.

The paper first explains the limitations of black-box access, including that it’s often hard or impossible to develop a general understanding of the system from behavioral data alone, that black-box evaluations can produce unreliable results, or that black-box evaluations offer limited insights to help address failures. 

Then, we lay out various advantages of white-box evaluations. The basic argument is that understanding the mechanism by which a model arrives at an output provides a lot of additional important information on top of behavioral observations. White-box evaluations are often more effective and efficient, e.g. in the case of (latent) adversarial attacks. Furthermore, the increased understanding from white-box evaluations can enable more targeted solutions approaches. Lastly, white-box evaluations can give stronger assurances: It is better to know that a model does the right thing for the right reason than merely observing that it does the right thing without understanding the reasons. 

Finally, the paper discusses security concerns and challenges of white-box access. Understandably, AI companies are worried that an external actor with access to the weights could steal them and e.g. use them for malicious purposes or publish them, thereby reducing the AI companies’ value proposition. We think these are serious and important concerns that pose challenges. Nevertheless, comparable industries, e.g. cybersecurity, have solved similar challenges. For example, privacy-preserving technologies might enable evaluators to run algorithms on model internals without being able to copy them or evaluators might access the weights in a “secure room (SCIF)” to prevent theft. 

Especially for the most dangerous systems, rigorously evaluating AI systems is challenging and likely requires white-box techniques. External evaluators from academia, industry, government and NGOs will play a vital role in running these white-box evaluations as third parties. It is thus important to build and refine these pipelines early on.

Previous
Previous

The first year of Apollo Research

Next
Next

Our work advancing scientific understanding to foster an effective international evaluation ecosystem