Research
Scheming reasoning evaluations
Apollo Research evaluated frontier models for in-context scheming capabilities. We found that multiple frontier models are capable of in-context scheming when strongly nudged to pursue a goal (and sometimes even without strong goal nudging). In this example, Opus-3 exfiltrates its “weights” and then lies about it to its developers.
We have a suite of six evaluations specifically designed to test for in-context scheming (where the goal and other relevant information are provided in context rather than training). We found that several models are capable of in-context scheming. When we look at the model’s chain-of-thought, we find that all models very explicitly reason through their scheming plans and often use language like “sabotage, lying, manipulation, …”
Towards Safety Cases For AI Scheming
Developers of frontier AI systems will face increasingly challenging decisions about whether their AI systems are safe enough to develop and deploy. One reason why systems may not be safe is if they engage in scheming, where AI systems hide their true capabilities and objectives and covertly pursue misaligned goals. In our new report, "Towards evaluations-based safety cases for AI scheming", written in collaboration with researchers from the UK AI Safety Institute, METR, Redwood Research, and UC Berkeley, we sketch how developers of AI systems could make a structured rationale – 'a safety case' – that an AI system is unlikely to cause catastrophic outcomes through scheming.
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
An obstacle to reverse engineering neural networks is that many of the parameters inside a network are degenerate, obfuscating internal structure. We identify ways that network parameters can be degenerate and introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to some degeneracies. We test this technique in a follow-up paper on toy models and GPT2.
Identifying functionally important features with end-to-end sparse dictionary learning
We propose end-to-end (e2e) sparse dictionary learning, a method for training sparse autoencoders (SAEs) that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
A Causal Framework for AI Regulation and Auditing
This article outlines a framework for evaluating and auditing AI to provide assurance of responsible development and deployment, focusing on catastrophic risks. We argue that responsible AI development requires comprehensive auditing that is proportional to AI systems’ capabilities and available affordances. This framework offers recommendations toward that goal and may be useful in the design of AI auditing and governance regimes.
Our research on strategic deception presented at the UK’s AI Safety Summit
We investigate whether, under different degrees of pressure, GPT-4 can take illegal actions like insider trading and then lie about its actions. We find this behavior occurs consistently, and the model even doubles down when explicitly asked about the insider trade. This demo shows how, in pursuit of being helpful to humans, AI might engage in strategies that we do not endorse. This is why we aim to develop evaluations that tell us when AI models become capable of deceiving their overseers.