Our research on strategic deception presented at the UK’s AI Safety Summit

Read the full technical report, which was accepted for an oral presentation at this year’s ICLR LM agents workshop.

Since our inception, our evaluations team has focused on conceptual and empirical work on strategic deception and deceptive alignment. We are happy to share that results from this work were presented at the UK’s AI Safety Summit on November 1st 2023 to leaders in government, civil society and AI labs.

We investigate whether, under different degrees of pressure, GPT-4 can take illegal actions like insider trading and then lie about its actions. We find this behavior occurs consistently, and the model even doubles down when explicitly asked about the insider trade.

This demo shows how, in pursuit of being helpful to humans, AI might engage in strategies that we do not endorse. This is why we aim to develop evaluations that tell us when AI models become capable of deceiving their overseers. This would help ensure that advanced models which might game safety evaluations are neither developed nor deployed.

This testing was done in a simulated and sandboxed environment, so no actions were executed in the real world. But the takeaway is that increasingly autonomous and capable AIs that deceive human overseers could lead to loss of human control.

Apollo Research was also announced as a partner of the UK’s Frontier AI Taskforce. We look forward to ongoing collaboration to detect the extreme risks from AI systems so that governments and AI labs can take technically informed measures against these potential harms.

Previous
Previous

A Causal Framework for AI Regulation and Auditing