Security at Apollo Research
The primary goal of Apollo Research is to create tools and research that reduces catastrophic risks posed by advanced Artificial Intelligence (AI). We aim to use our findings to audit state-of-the-art models for deception and other dangerous behaviors. For more details about our organization, see our announcement post.
Our plans have some downside risks that we aim to mitigate through security measures. Here, we present some of our norms and processes related to security and publishing at Apollo Research. We are an early-stage organization and the field of AI auditing is still young, so we expect this policy to be refined and improved over time. We’re interested in feedback.
Collaborations and Auditing
When dealing with the valuable IP of AI labs, our security standards must be very high. A leak of IP could not only be detrimental to the existence of Apollo Research, it could severely hurt the interests of the lab we’re collaborating with, and potentially cause serious harm if certain information or technology found its way into the wrong hands.
To manage this risk we have a range of strategies in place.
All parties involved in the collaboration should be covered by an NDA.
Following the principle of least privilege, we aim to keep sensitive information to as few people as is practical. For example, by building internal APIs which obscure details of the models that we’re running evaluations on, and the companies who produced those models, such that an internal researcher/engineer would just see “model 1” instead of “<lab>’s model”.
In cases where we require (partial) access to model internals for e.g. interpretability evaluations, we are willing to explore solutions such as:
Undertaking evaluations in air-gapped on-site “data rooms” at the labs with permanent watch.
Secure multi-party communication, where the auditor does not have direct access to model weights, and where the AI lab may not have direct access to the evaluation program.
Sharing some of our evals with the lab so they can run them in-house (with oversight or other means of validation if relevant).
Sharing the parts of our tools that interface with the most valuable IP and allowing developers from the collaborating lab to run them internally.
Other methods in the evolving area of structured transparency (see here and here).
The field of AI auditing and model evaluations is still young. We hope that multiple innovations in software and processes will make it easier and safer for all involved parties to engage in auditing practices. Until processes and norms are more evolved, we are flexible and looking to cooperate with labs to find reasonable temporary solutions.
Gain-of-Function Research
We think that, due to economic and social pressures, labs will continue to improve the capabilities of state-of-the-art models. This will often lead to the emergence of new, sometimes dangerous capabilities. Deploying models to the public increases the number of ways that a potentially catastrophic outcome could ensue. To inform decisions on whether or not to distribute a model to the public, we think it is sometimes necessary to assess yet-unseen dangerous capabilities in strongly controlled environments. This is also known as gain-of-function research. This research can alert labs, the public, and regulators to potential risks from advanced models, as well as provide a valuable object of research into dangerous capabilities.
Most immediately, this work could reduce the chance of widely deploying a model that would result in accidental harm or misuse. In the longer term, it could reduce the chance of a model capable of causing harm being developed in the first place.
Whether any given gain-of-function research project passes the cost-benefit analysis depends heavily on the situation and should be decided on a case-by-case basis. Gain-of-function research should especially avoid experiments with sufficiently advanced models that run the risk of a model autonomously exfiltrating itself from its controlled environment and causing severe harm. It is paramount to keep the risk of negative side effects as low as possible and to be very careful when engaging in this type of research. Therefore,
We will carefully consider the benefits and risks before running any experiment and only run them if the benefits clearly outweigh the harms.
In many cases, we intend to break down dangerous behavior into separate, smaller components that are much less dangerous in isolation. Analogously, many chemical components are harmless in isolation but dangerous when combined.
We don’t actively train models for catastrophically dangerous behavior, i.e. we would not train or fine-tune a very capable model to try to take over the world to “see how far it gets”.
In high-stakes scenarios, we will consult with external AI and security experts, e.g. in governments and other organizations, to make sure we are not making unilateral decisions.
Publishing
There are clear benefits to publishing the research and tools we build:
Publication accelerates overall research progress in the fields of model behavioral evaluations and interpretability.
Publishing makes it easy for a lab building advanced AI systems to implement our insights/tools in their own models to increase their safety.
Publishing enables criticism from academics and other labs.
However, there are some potential downsides to publishing our work:
Given that our research focuses partly on dangerous capabilities, we may unveil insights that could be exploited by bad actors.
We may gain fundamental insights about how AIs work which could be used to increase the pace of AI progress without increasing safety proportionally.
To balance benefits and risks, we engage in differential publishing. Under this policy, we disseminate information with varying degrees of openness, depending on the project's security needs and our assessment of the benefits of sharing the work. The audience could range from a small, select group within Apollo Research to the broader public. Since reversing a publication is nearly impossible, we will err on the side of caution in unclear cases. Concretely, we will internally evaluate the safety implications of any given project at every stage in its lifecycle, and potentially increase the number of actors with access in smaller steps rather than all at once.
Privacy Levels
Drawing from the project privacy levels outlined here, and informed by our own experiences with infohazard policies in practice, we adhere to three tiers of privacy for all projects.
Public: Can be shared with anyone.
Private: Shared with all Apollo Research employees and potentially select external individuals or groups.
Secret: Shared only with select individuals (inside or outside of Apollo Research).
The privacy level of each project is always transparent to all parties with knowledge of and access to the project.
The allocation of privacy level for each project is by default decided by the director of the organization (currently Marius), in consultation with the person(s) who proposed the project and (optionally) a member of the security team. In the future, we may decide to appoint trusted external individuals to have a say on the privacy level of projects.
By default, we assign all new projects as Secret, and the burden of proof is on reducing the privacy level of each project. Given the large benefits of public research, we will make an active effort to publish everything we deem safe to publish and provide explanations and helpful demos for different audiences.
Information security
We take infosec very seriously and implement practices that are at a high standard for the industry. For obvious reasons, we don’t discuss our infosec practices in public (If there are particular infosec-related insights that you think we might want to know about, please reach out).
Summary
We think auditing is strongly positive for society when done right, but could lead to harm if incorrectly executed. Therefore, we aim to continuously refine and improve our security and publishing policies to maximize the benefits while minimizing the risks.
Credit assignment
Dan Braun and Marius Hobbhahn drafted the post and iterated on it. Lee Sharkey and Jérémy Scheurer provided suggestions and feedback. We’d like to thank Jeffrey Ladish for external feedback and discussion.