The UK AI Safety Summit - our recommendations

Apollo Research is an independent third-party frontier AI model evaluations organisation. We focus on deceptive alignment, where models appear aligned but are not. We believe it is one of the most important components of many extreme AI risk scenarios. In our work, we aim to understand and detect the ability of advanced AI models to evade standard safety evaluations, exhibit strategic deception and pursue misaligned goals.

We think AI model evaluation is a vital lever for mitigating risks from the most capable AI systems and are excited that one objective for the UK AI Safety Summit is “areas for potential collaboration on AI safety research, including evaluating model capabilities and the development of new standards to support governance”. 

What commitments we would like to see arise from the AI Safety Summit

AI is a global technology both in how it is developed and the spread of risks, so we hope the Summit will lead to commitments to explore international alignment on: 

  • comprehensive risk classification for AI systems based on harmful outcomes they can create, so that AI developers and users understand how their actions may affect the risk profile of the AI system

  • responsibilities of all actors across the AI value chain, including developers of narrow and general purpose AI, those building services on top of these systems, and evaluators

  • pursuing cooperation on compliance and enforcement activities where requirements - regulatory or otherwise - are not adhered to or harms occur. This assures that conscientious actors know they will not be out-competed by those who take shortcuts or act negligently

These commitments are important because they determine regulation and testing requirements for AI systems, enable innovation by clarifying compliance actions for AI developers and give confidence to consumers and the public. This will also help evaluators like Apollo Research focus our efforts on the models that pose the greatest risks.

Ensuring international AI safety commitments are technically sound

It is important for these commitments to be technically informed so that the ambitious course set by the Summit achieves its goals and enables safer AI innovation. So a while back we proposed to the Frontier AI Taskforce a range of activities before and during the Summit which we believe will empower delegates with appropriate knowledge, thus contributing to the Summit’s five objectives.

We think these are important topics for the Summit this November and international meetings on AI safety in the coming 1-2 years. So we have thought about how to convene and get value from these sessions in greater depth than outlined below, and would welcome any further questions or interest from people convening events on AI safety. 

Before the Summit:

1) Develop educational materials for policymakers on embedding safety across the life-cycle of an AI system: from prior-to-training risk assessments to staged release.

During the Summit:

2) Provide demonstrations of a range of significant AI safety risks and how evaluations can identify and help address them. 

3) Provide a presentation on how governance institutions can use compute benchmarks to effectively identify AI systems requiring the most thorough oversight including evaluations

1) Develop educational materials for policymakers on embedding safety across the life-cycle of an AI system

We think that a whole life-cycle approach is necessary for managing risks from AI systems, but there is limited technical consensus on how this should work in practice. This makes this topic very challenging to present at the Summit, or other events for non-technical policy audiences.

We therefore propose that Frontier AI Taskforce uses its standing as an independent organization of technical experts to convene a range of perspectives, draw out key insights, and present this as educational materials for Summit attendees ahead of the Summit. By sharing these materials with UK delegates and (if relevant) those of other states, they can be better equipped to discuss the promise, and potential downsides, of such interventions and the actions required of different stakeholders.

To enable a focused conversation, we propose a focus on two high impact areas of debate: i) prior-to-training risk assessments and ii) staged release, i.e. how available the AI system is to people and how connected it is to other systems.

What a roundtable session could look like

We suggest two interconnected roundtables to convene experts and produce findings which can be turned into educational material. 

We expect the Taskforce will be able to build consensus on some of these topics, but that there is also value in mapping out areas where experts disagree and further research is needed. The output should be a brief no more than four pages long summarising areas of consensus and debate, and any policy implications these might carry for the Summit and beyond.

Learning objective for educational materials:

  • The opportunity for preventing harms from highly capable AI systems through embedding evaluation across the life-cycle, especially upstream in pre-training risk assessments

  • The benefits and drawbacks of a staged approach to deployment and increasing an AI system’s available affordances (i.e. the environmental resources and opportunities for affecting the world that are available to a model)

Roundtable 1: how can prior-to-training risk assessment be a lever for avoiding AI risks. Recommended discussion points include:

  • which AI systems need the most rigorous risk assessments before their development begins, by whom, and whether further safeguards are needed downstream; 

  • which aspects of an AI system’s risk are determined by decisions about its training, and which are the priority risks to identify during a prior-to-training risk assessment;

  • options for implementing risk assessments, and what cost-to-benefit trade-offs each creates; in particular, what benefits and issues might arise from a coordinated international approach on risk assessments (or just coordination between like-minded countries)

Roundtable 2: what are the benefits and costs of staged release. Recommended discussion points include:

  • how can staged release improve AI safety through creating iterative reviews of capabilities and risks: both in theory and with regards to lessons learned from releases of advanced AI or other software (where relevant); 

  • what the priority risk points are in each stage of a model’s deployment and / or change in available affordances that will require risk assessment and testing; 

  • options for concretely implementing staged release and what cost-to-benefit trade-offs each creates, particularly:

    • opportunity cost of placing ‘speed bumps’ in the way of access to beneficial AI Vs. avoiding harm, and;

    • the benefits / issues in global governance which might arise from a coordinated approach or indeed a diverging approach between like-minded countries

2) Demonstrations of a range of significant AI safety risks and how evaluations can identify and help address them

We think that demonstrating how evaluations can identify and address a range of significant AI safety risks will help build a shared understanding of the risks posed by frontier AI, and the need for international collaboration . This will directly support the UK’s aim to evaluate frontier models within its own safety agenda and give direction as to what international frontier AI safety collaboration should focus on.

We think the Summit, combined with the portfolio focus on evaluation by the Frontier AI Taskforce, gives the UK a distinctive moment to show how evaluations can enable: i) systems for approving AI deployments; ii) benchmarking of how AI capabilities are changing and consequently forward planning to strengthen safety regimes, and; iii) use findings from both evaluations and benchmarking to improve AI safety research agendas.

What a demonstration session could look like:

  • Evaluation demonstrations: each evaluator should present their techniques and findings to date and its implications for real-world risks in today’s AI models as well as models with even more advanced capabilities. This could include AI safety organisations as well as independent academic groups

  • Q&A with evaluators and experts: we think this should be done as a panel so that key messages can land at scale about how the field of evaluation is growing and to what extent it can help with major AI risks including the field’s constraints. We would be especially excited for researchers who have written extensively on limitations of existing benchmarks and evaluations to contribute to such sessions

Learning objectives for attendees:

  • Understand what experts have identified as extreme risks from AI systems. We think these are: strategic deception; bio-risks; security risks to AI systems, including how guard rails can be circumvented e.g. ‘jail-breaking’ or fine-tuning

  • Understand how AI risks change as models increase in size (e.g. compute, parameters, data) and by implication the expected trajectory of AI risks as model size and training compute increase

  • Understand the limitations of our current evaluations and how they may constrain safe deployment of AI

3) How governance institutions can use compute benchmarks to identify AI systems that require most robust oversight 

Compute should be a factor in deciding which AI systems are likely to exceed current capabilities and therefore require the most robust evaluations. However algorithmic progress continually reduces the amount of compute needed to obtain similar levels of performance. This means that there is a large risk that over-anchoring to compute will cause some high-risk models to slip through the net.

We think it is important to implement compute governance the right way on the first try. Otherwise, harms might still occur despite evaluations targeting what is thought to be the most high-risk AI. We believe that a panel presentation on this topic will be a valuable contribution to the Summit’s outcomes.

What a panel presentation could look like

  • Demonstration of scaling laws, illustrating in particular how compute corresponds with: i) performance differences between models, and; ii) emergent capabilities

  • Expert panel discussion of:

    • the benefits and limitations of using compute measures to place oversight mechanisms on AI systems

    • practical considerations and feasibility of monitoring compute

Learning objective for attendees:

  • Why compute is a key ingredient for higher capability models and therefore higher risk models 

  • How monitoring compute used in training runs can be an effective lever for avoiding extreme AI risks

  • What other relevant considerations to predict AI risk there are beyond compute

Previous
Previous

Recommendations for the next stages of the Frontier AI Taskforce

Next
Next

Understanding strategic deception and deceptive alignment