Apollo is adopting Inspect

We at Apollo Research are adopting Inspect AI as our evals framework after a successful 3-month internal trial. This means that we will be deprecating our internal evals framework and building all new evals using Inspect.

Why we considered moving to an open-source framework

Over the last year, we have developed a lightweight evals framework that was specifically designed for our use cases. Until recently, we have built all our evals using this framework and run tens of thousands of samples across many tasks and models. The decision to replace our framework with an open-source alternative wasn’t straightforward; we considered the following arguments.

Pros:

  • Avoiding duplication of effort. Common challenges in building evaluation frameworks—like scaling tasks and creating secure agent sandboxes—are better solved collectively. Rather than developing parallel solutions, pooling our resources to create one robust implementation benefits the entire community.

  • Making it easier to share evals. In previous collaborations, we found it quite painful to port our evals into the framework that our partners were most familiar with. This added unnecessary overhead. We hope that the community converges on Inspect as the default sharing standard to reduce this friction.

  • Making it easier to share expertise. We speculate that a common framework might enable developers to collaborate more productively. For example, if we want to discuss evals approaches and implementations with other teams, it will be helpful to have a shared understanding of the relevant abstractions and the ability to compare code. 

  • Belief in open-source tooling. A robust evaluation ecosystem requires accessible, high-quality tools. By contributing to Inspect, we're helping build this foundation and making it easier for others to create effective evals.

Cons:

  • Less control over high-level decisions. Every framework involves design tradeoffs that can spark debate. Inspect is no exception—for instance, while our internal framework prioritizes type safety, Inspect currently doesn’t. These architectural differences will likely persist. In the worst case, these high-level differences would prevent or significantly slow down our evals development.

  • Less specific to our needs. Open-source frameworks need to implement features in a general way which caters to all users. In contrast, we were able to build our internal framework in a narrow way which was more efficient for our needs. A more general library with broader scope may introduce complexity which could cause development and debugging overhead compared to our more specialized internal framework. 

  • Slower turnaround time for new features. Feature implementation will likely take longer in an external framework than one that we own, given the additional coordination required with maintainers. Maintainers of an external framework may not be able to prioritize the features that are important to us.

  • Effort to migrate our existing eval implementations. It might take weeks of effort to migrate our existing eval implementations and there will be new bugs to discover and fix. Therefore, we only want to migrate to a new framework if there is a clear benefit.

Why we chose Inspect

We think Inspect is the obvious candidate for a general evals framework serving as a Schelling point for the community. It supports a wide range of types of evals and is the default framework for many independent researchers.

Thus, we spent the last three months evaluating whether to use Inspect. We decided to switch for the following reasons:

  • Inspect supports our existing use cases. During our trial, we have extensively built and run evals at scale using Inspect. While we ran into various issues, we’re convinced that these are not fundamental blockers and can be resolved in the future.

  • We think Inspect will continue to support our future use cases. The design of Inspect is robust and flexible enough to support our future use cases. Furthermore, we expect that Inspect will implement features which would take us months to build ourselves. For example, secure agent sandboxes and tools for interpretability-based evals. Given our overlapping evaluation needs with AISI, we expect both parties to benefit.

  • Additional features. Inspect has a number of features that our framework doesn’t have, e.g. web browsing tools and approval mode. Though we don’t necessarily need all of these features now, the fact that they are available opens up more possibilities for us. We expect this trend to continue since Inspect supports a large class of use cases and developers.

  • Confidence in the Inspect development team. We have been genuinely impressed with the Inspect development team, led by JJ Allaire at the UK AI Safety Institute. For example, we often got high-quality answers within minutes or hours of asking complex questions, and many features were implemented the morning after we voiced our needs. We felt like our disagreements on high-level design decisions were heard and taken seriously, and we look forward to working with the team to improve these areas. Overall, we are confident that the team aims to build a tool that is useful for the wider evals ecosystem in addition to their own use cases.

Initially, we were skeptical about switching due to the cost of migration and risk of losing control over high-level decisions. However, over our trial period we have become convinced that the team is highly competent and will prioritize our requirements and continue to maintain the software for at least a few years.

Ultimately, we made this decision because we believe that the upfront investment in adopting Inspect and migrating our evals will pay itself back, enabling us to build better evals faster. We think this effect will be strengthened if others in the community also decide to adopt Inspect.

Future contributions

We aim to help improve Inspect. As relatively heavy users of Inspect, we want to help improve the framework and give back to the community. For example, we are currently talking with the Inspect team about improvements to config management and type safety to make building and running evals faster and more reliable. We will likely also contribute features or tutorials, with a primary focus on our core expertise, i.e. running frontier agent evals at scale.

We encourage others to consider Inspect. We believe that the more developers are using the same framework, the more the benefits will compound. It seems clear that the community will want a lot of the same features. Sharing of tasks will also become easier if everyone uses a common format. And everyone will be able to focus more on research and less on infrastructure.

If you’re considering Inspect but are unsure, feel free to reach out to us.

Previous
Previous

Apollo 18-month update

Next
Next

The Evals Gap