An Opinionated Evals Reading List

We have greatly benefited from reading others’ papers to improve our research ideas, conceptual understanding, and concrete evaluations. Thus, the Apollo Research evaluations team compiled a list of what we felt were important evaluation-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.

Our favorite papers

  • Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)

    • Contains detailed descriptions of multiple LM agent evals across four categories. Also explores new methodologies for estimating evals success probabilities. 

    • We think it is the best “all around” evals paper, i.e. giving the best understanding of what frontier LM agent evals look like

    • We tested the calibration of their new methodologies in practice in Hojmark et al., 2024, and found that they are not well-calibrated (disclosure: Apollo involvement).

  • Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)

    • They find that it is possible to find a low-rank decomposition of models’ capabilities from observed benchmark performances. These can be used to predict the performance of bigger models in the same family.

    • Marius: I think this is the most exciting “science of evals” paper to date. It made me more optimistic about predicting the performance of future models on individual tasks.

  • The Llama 3 Herd of Models (Meta, 2024)

    • Describes the training procedure of the Llama 3.1 family in detail

    • We think this is the most detailed description of how state-of-the-art LLMs are trained to date, and it provides a lot of context that is helpful background knowledge for any kind of evals work.

  • Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022) 

    • Shows how to use LLMs to automatically create large evals datasets. Creates 154 benchmarks on different topics. We think this idea has been highly influential and thus highlight the paper.

    • The original paper used Claude-0.5 to generate the datasets, meaning the resulting data is not very high quality. Also, the methodology section of the paper is much more confusingly written than it needs to be. 

    • For an improved methodology and pipeline for model-written evals, see Dev et al., 2024 or ARENA chapter 3.2 (disclosure: Apollo involvement). 

  • Evaluating Language-Model Agents on Realistic Autonomous Tasks (Kinniment et al., 2023)

    • Introduces LM agent evals for model autonomy. It’s the first paper that rigorously evaluated LM agents for risks related to loss of control, thus worth highlighting.

    • We recommend reading the Appendix as a starting point for understanding agent-based evaluations. 

Other evals-related publications

LM agents

Core:

Other:

Benchmarks

Core:

Other:

Science of evals

Core:

Other:

Software

Core:

  • Inspect

    • Open source evals library designed and maintained by UK AISI and spearheaded by JJ Allaire, who intends to develop and support the framework for many years.

    • Supports a wide variety of types of evals, including MC benchmarks and LM agent settings.

  • Vivaria

    • METR’s open-sourced evals tool for LM agents

    • Especially optimized for LM agent evals and the METR task standard

  • Aider 

    • Probably the most used open-source coding assistant

    • We recommend using it to speed up your coding

Other:

Miscellaneous

Core:

Other:

Related papers from other fields

Red teaming

Core:

Other:

Scalable oversight

Core:


Other:

Scaling laws & emergent behaviors

Core:

Other:

Science tutorials

Core:

  • Research as a Stochastic Decision Process (Steinhardt)

    • Argues that you should do experiments in the order that maximizes information gained.

    • We use this principle all the time and think it’s very important.

  • Tips for Empirical Alignment Research (Ethan Perez, 2024),

    • Detailed description of what success in empirical alignment research can look like

    • We think it’s a great resource and aligns well with our own approach.

  • You and your Research (Hamming, 1986)

    • Famous classic by Hamming. “What are the important problems of your field? And why are you not working on them?”

Other:

LLM capabilities

Core:

Other:

RLHF

Core:

Other:

Supervised Finetuning/Training & Prompting

Core:

Other:

Fairness, bias, and accountability

AI Governance

Core: 

Other:

Contributions

The first draft of the list was based on a combination of various other reading lists that Marius Hobbhahn and Jérémy Scheurer had previously written. Marius wrote most of the final draft with detailed input from Jérémy and high-level input from Mikita Balesni, Rusheb Shah, and Alex Meinke. 

Next
Next

Our current policy positions