Announcing Apollo Research

Summary

  1. We are a new AI evals research organization called Apollo Research. We are based in London.

  2. We think that AI deception, i.e. when models appear aligned but aren’t, is one of the most important steps in many major catastrophic AI risk scenarios. Furthermore, we think detecting deception in real-world models is the biggest and most tractable step to addressing this problem.

  3. Our agenda is split into interpretability and behavioral model evaluations.

    1. On the interpretability side, we are currently working on a new technical approach and a solution to the potential problem of superposition.

    2. On the evals side, we are conceptually breaking down ‘deception’ into fundamental components and prerequisites from which we aim to build an informative evaluation suite.

  4. As an evals research org, we intend to use our research insights and tools to serve as a third-party external auditor for the frontier models of AGI labs, reducing the chance that deceptive AIs are developed and deployed.

  5. We also intend to engage in AI governance, e.g. by working with relevant policymakers and providing technical expertise to the drafting of auditing regulations.

  6. We have some starter funding but still have a lot of room for funding. Reach out if you’re interested in donating. We are fiscally sponsored by Rethink Priorities.

We’re interested in collaborating with people in technical AI, governance and more. We intend to hire once our funding gap is closed.

Agenda

We believe that AI deception – where a model outwardly seems aligned but is in fact misaligned, and conceals this fact from human oversight – is a crucial component of many catastrophic risk scenarios from AI. Furthermore, we think that detecting/measuring deception is an important step in many solutions. For example, having good detection tools enables much faster iterations for empirical solution approaches, enables us to point to concrete failure modes for lawmakers and the wider public and enables us to provide evidence to AGI labs in case they are (thinking about) deploying deceptively aligned AI models.

Ultimately, we aim to develop a holistic and far-ranging suite of deception evals that includes behavioral tests, fine-tuning, and interpretability-based approaches. For now, we split up the agenda into an interpretability research and a model evaluations arm.

On the interpretability side, we are currently working on a new unsupervised approach, as well as continuing work on an existing approach to attack the problem of superposition. Early experiments have shown promising results but it is too early to tell if the techniques work or are scalable to larger models.

On the model evaluations side, we want to build a large and robust eval suite to test models for deceptive capabilities. Concretely, we intend to break down deception into its component concepts and capabilities. We will then design a large range of experiments and evaluations to measure both the component concepts as well as deception holistically.

Plans

As an evals research org, we intend to put our research into practice by engaging directly in auditing and governance efforts. This means we want to work with AGI labs and reduce the chance that they deploy deceptively aligned models. To reduce unnecessary risks for the AGI labs, we are therefore working on strategies that reduce risks of IP theft for labs and explore other ways of cooperating.

On the governance side, we want to use our technical expertise in auditing, model evaluations and interpretability to inform the public and lawmakers. We are interested in providing demonstrations of failures, and the feasibility of using model evaluations to detect them. We think showcasing failures makes it easier for the ML community, lawmakers and the wider public to understand the potential dangers of deceptively aligned AIs. To assist lawmakers with technical expertise, we intend to showcase the feasibility of model evaluations or auditing techniques.

We are interested in collaborating with people in governance and technical ML. If you’re (potentially) interested in collaborating with us, please reach out.

Theory of change

We hope to achieve a positive impact on multiple levels:

  1. Direct impact through research: If our research agenda works out, we will further the state of the art in interpretability and model evaluations. These results could then be used and extended by academics and labs.

  2. Direct impact through auditing: By working with AGI labs and auditing their models, we could help them determine if their model is, or could be deceptive and thus reduce the chance of developing and deploying deceptive models.

  3. Indirect impact through demonstrations: We hope that demonstrating failures and the feasibility of using model evaluations provide a more accurate picture of what is currently possible. We think questions about AI risks are very important and neither experts (including us) nor the wider public currently have an accurate understanding of what state-of-the-art models can and cannot do.

  4. Indirect impact through governance work: We intend to contribute technical expertise to AI governance where we can. This could include the creation of guidelines for model evaluations, conceptual clarifications of how AIs could be deceptive, suggestions for technical legislation and more.

Status

We received starter funding to get us off the ground but we still have room for additional funding. We estimate that we can effectively use an additional $4-6M. If you’re interested in financially supporting us, please reach out. We are happy to address any questions and concerns. We are a fiscally sponsored project of Rethink Priorities and thus have 501(c)3 status.

Previous
Previous

Security at Apollo Research