Science

We conduct fundamental research into the science of scheming and its potential mitigations. We also develop and run pre-deployment evaluations of frontier AI systems.

our key papers

Stress Testing Deliberative Alignment for Anti-Scheming Training

September 17, 2025
Read more

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

July 15, 2025
Read more

Frontier Models are Capable of In-Context Scheming

December 5, 2024
Read more

Towards Safety Cases For AI Scheming

October 31, 2024
Read more

Large Language Models can Strategically Deceive their Users when Put Under Pressure

November 9, 2023
Read more

A Causal Framework for AI Regulation and Auditing

November 8, 2023
Read more
All posts
Filter
  • All
  • Science of Scheming
  • Evaluations
  • Interpretability
  • Notes
Science of Scheming
  • Science of Scheming

Metagaming matters for training, evaluation, and oversight

March 16, 2026
Read more
Science of Scheming
  • Science of Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

September 17, 2025
Read more
Evaluations
  • Evaluations

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

July 15, 2025
Read more
Evaluations
  • Evaluations
  • Notes

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

July 3, 2025
Read more
Evaluations
  • Evaluations

More Capable Models Are Better At In-Context Scheming

June 19, 2025
Read more
Evaluations
  • Evaluations
  • Notes

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

March 17, 2025
Read more
Evaluations
  • Evaluations

Forecasting Frontier Language Model Agent Capabilities

February 24, 2025
Read more
Interpretability
  • Interpretability

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

February 11, 2025
Read more
Interpretability
  • Interpretability

Detecting Strategic Deception Using Linear Probes

February 6, 2025
Read more
Evaluations
  • Evaluations

Demo Example – Scheming Reasoning Evaluations

January 23, 2025
Read more
Evaluations
  • Evaluations

Frontier Models are Capable of In-Context Scheming

December 5, 2024
Read more
Evaluations
  • Evaluations

The Evals Gap

November 11, 2024
Read more
Evaluations
  • Evaluations

Towards Safety Cases For AI Scheming

October 31, 2024
Read more
Evaluations
  • Evaluations

An Opinionated Evals Reading List

August 15, 2024
Read more
Interpretability
  • Interpretability

Identifying functionally important features with end-to-end sparse dictionary learning

May 30, 2024
Read more
Interpretability
  • Interpretability

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

May 30, 2024
Read more
Evaluations
  • Evaluations

Black-Box Access is Insufficient for Rigorous AI Audits

April 4, 2024
Read more
Evaluations
  • Evaluations

We Need A ‘Science of Evals’

January 22, 2024
Read more
Evaluations
  • Evaluations

A Starter Guide For Evals

January 8, 2024
Read more
Evaluations
  • Evaluations

Large Language Models can Strategically Deceive their Users when Put Under Pressure

November 9, 2023
Read more
Evaluations
  • Evaluations

A Causal Framework for AI Regulation and Auditing

November 8, 2023
Read more
Evaluations
  • Evaluations

Our research on strategic deception presented at the UK’s AI Safety Summit

November 5, 2023
Read more
Evaluations
  • Evaluations

Understanding strategic deception and deceptive alignment

September 15, 2023
Read more