Research
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
We introduce Attribution-based Parameter Decomposition (APD), a method that directly decomposes a neural network's parameters into components that (i) are faithful to the parameters of the original network, (ii) require a minimal number of components to process any input, and (iii) are maximally simple. While challenges remain to scaling APD to non-toy models, our results suggest solutions to several open problems in mechanistic interpretability.
Precursory Capabilities: A Refinement to Pre-deployment Information Sharing and Tripwire Capabilities
In our new papers on ‘Pre-Deployment Information Sharing: A Zoning Taxonomy for Precursory Capabilities’ and ‘Towards Frontier Safety Policies Plus,’ we explore methods to better operationalise existing capability thresholds through, what we call, ‘precursory capabilities.’ We think of precursory capabilities as smaller preliminary components to high-impact capabilities that an AI model needs to have in order to unlock more advanced capabilities. More specifically, we see precursory capabilities as located in what we call a ‘zoning taxonomy’––a gradient where each precursory component may bring us closer to unacceptable risk or ‘red lines’.
Detecting Strategic Deception Using Linear Probes
Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive.
We built probes using simple training data (from RepE paper) and techniques (logistic regression). We test these probes in more complicated and realistic environments where Llama-3.3-70B responds deceptively. The probe fires far less on alpaca responses unrelated to deception, indicating it may partially be a probe for “deception-related” text rather than “actually-deceptive” text.
Scheming reasoning evaluations
Apollo Research evaluated frontier models for in-context scheming capabilities. We found that multiple frontier models are capable of in-context scheming when strongly nudged to pursue a goal (and sometimes even without strong goal nudging). In this example, Opus-3 exfiltrates its “weights” and then lies about it to its developers.
We have a suite of six evaluations specifically designed to test for in-context scheming (where the goal and other relevant information are provided in context rather than training). We found that several models are capable of in-context scheming. When we look at the model’s chain-of-thought, we find that all models very explicitly reason through their scheming plans and often use language like “sabotage, lying, manipulation…”
Towards Safety Cases For AI Scheming
Developers of frontier AI systems will face increasingly challenging decisions about whether their AI systems are safe enough to develop and deploy. One reason why systems may not be safe is if they engage in scheming, where AI systems hide their true capabilities and objectives and covertly pursue misaligned goals. In our new report, "Towards evaluations-based safety cases for AI scheming", written in collaboration with researchers from the UK AI Safety Institute, METR, Redwood Research, and UC Berkeley, we sketch how developers of AI systems could make a structured rationale – 'a safety case' – that an AI system is unlikely to cause catastrophic outcomes through scheming.
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
An obstacle to reverse engineering neural networks is that many of the parameters inside a network are degenerate, obfuscating internal structure. We identify ways that network parameters can be degenerate and introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to some degeneracies. We test this technique in a follow-up paper on toy models and GPT2.
Identifying functionally important features with end-to-end sparse dictionary learning
We propose end-to-end (e2e) sparse dictionary learning, a method for training sparse autoencoders (SAEs) that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
A Causal Framework for AI Regulation and Auditing
This article outlines a framework for evaluating and auditing AI to provide assurance of responsible development and deployment, focusing on catastrophic risks. We argue that responsible AI development requires comprehensive auditing that is proportional to AI systems’ capabilities and available affordances. This framework offers recommendations toward that goal and may be useful in the design of AI auditing and governance regimes.
Our research on strategic deception presented at the UK’s AI Safety Summit
We investigate whether, under different degrees of pressure, GPT-4 can take illegal actions like insider trading and then lie about its actions. We find this behavior occurs consistently, and the model even doubles down when explicitly asked about the insider trade. This demo shows how, in pursuit of being helpful to humans, AI might engage in strategies that we do not endorse. This is why we aim to develop evaluations that tell us when AI models become capable of deceiving their overseers.