An Opinionated Evals Reading List
We have greatly benefited from reading others’ papers to improve our research ideas, conceptual understanding, and concrete evaluations. Thus, the Apollo Research evaluations team compiled a list of what we felt were important evaluation-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.
Our favorite papers
Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
Contains detailed descriptions of multiple LM agent evals across four categories. Also explores new methodologies for estimating evals success probabilities.
We think it is the best “all around” evals paper, i.e. giving the best understanding of what frontier LM agent evals look like
We tested the calibration of their new methodologies in practice in Hojmark et al., 2024, and found that they are not well-calibrated (disclosure: Apollo involvement).
Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)
They find that it is possible to find a low-rank decomposition of models’ capabilities from observed benchmark performances. These can be used to predict the performance of bigger models in the same family.
Marius: I think this is the most exciting “science of evals” paper to date. It made me more optimistic about predicting the performance of future models on individual tasks.
The Llama 3 Herd of Models (Meta, 2024)
Describes the training procedure of the Llama 3.1 family in detail
We think this is the most detailed description of how state-of-the-art LLMs are trained to date, and it provides a lot of context that is helpful background knowledge for any kind of evals work.
Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022)
Shows how to use LLMs to automatically create large evals datasets. Creates 154 benchmarks on different topics. We think this idea has been highly influential and thus highlight the paper.
The original paper used Claude-0.5 to generate the datasets, meaning the resulting data is not very high quality. Also, the methodology section of the paper is much more confusingly written than it needs to be.
For an improved methodology and pipeline for model-written evals, see Dev et al., 2024 or ARENA chapter 3.2 (disclosure: Apollo involvement).
Evaluating Language-Model Agents on Realistic Autonomous Tasks (Kinniment et al., 2023)
Introduces LM agent evals for model autonomy. It’s the first paper that rigorously evaluated LM agents for risks related to loss of control, thus worth highlighting.
We recommend reading the Appendix as a starting point for understanding agent-based evaluations.
Other evals-related publications
LM agents
Core:
LLM Powered Autonomous Agents (Weng, 2023)
Great overview post about LM agents. Probably a bit outdated.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., 2024)
They describe and provide tools that make it easier for LM agents to interact with code bases.
If you work with LM agents that interact with code, you should understand this paper.
Evaluating Language-Model Agents on Realistic Autonomous Tasks (METR Report)
See top
Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
See top
Other:
Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023)
Introduces tree of thought, a technique to potentially improve reasoning abilities of LM agents
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models (Zhou et al., 2023)
Introduces LATS, a technique to potentially improve reasoning abilities of LM agents
Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure (Scheurer et al., 2023)
Potentially useful tips for prompting are in the appendix. Might also be a good source of how to do ablations in basic agent evaluations.
Disclosure: Apollo paper.
Identifying the Risks of LM Agents with an LM-Emulated Sandbox (Ruan et al., 2023)
Design an automated emulation environment based on LLMs to run agent evaluations in a sandboxed environment.
Opinion: We think it is important to understand how LM agents are being built. However, we recommend that most evaluators (especially individuals) should not spend a lot of time iterating on different scaffolding and instead use whatever the public state-of-the-art is at that time (e.g. AIDER). Otherwise, it can turn into a large time sink, and frontier AI companies likely have better internal agents anyway.
Benchmarks
Core:
MMLU: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)
MC benchmark for many topics
Probably the most influential LLM benchmark to date
Is potentially saturated, or might soon be
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2024)
Tests whether LM agents can solve real-world github issues
The OpenAI evals contractors team has released a human-verified subset of SWEBench to fix some problems with the original setup.
See follow-up MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (Chan et al., 2024)
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (Li et al., 2024)
Big multiple choice (MC) benchmark for Weapons of Mass Destruction Proxies
Other:
GPQA: A Graduate-Level Google-Proof Q&A Benchmark (Rein et al., 2023)
QA dataset with really hard questions that even experts might not be able to answer correctly
AgentBench: Evaluating LLMs as Agents (Liu et al., 2023)
Presents 8 open-ended environments for LM agents to interact with
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs (Laine et al., 2024)
Evaluates situational awareness, i.e. to what extent LLMs understand the context they are in.
Disclosure: Apollo involvement
TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021)
Evaluates whether LLMs mimic human falsehoods such as misconceptions, myths, or conspiracy theories.
Towards Understanding Sycophancy in Language Models (Sharma et al., 2023)
MC questions to evaluate sycophancy in LLMs
GAIA: a benchmark for General AI Assistants (Mialon, 2023)
Benchmark with real-world questions and tasks that require reasoning for LM agents
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark (Pan et al., 2023)
134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making
Science of evals
Core:
Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)
See top
We need a science of evals (Apollo, 2024)
A motivational post for why we need more scientific rigor in the field of evaluations. Also suggests concrete research projects.
Disclosure: Apollo post
HELM: Holistic Evaluation of Language Models (Liang et al., 2022)
They evaluate a range of metrics for multiple evaluations across a large set of models. It’s intended to be a living benchmark.
Contains good prompts and ideas on how to standardize benchmarking
Other:
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
Explores using LLM judges to rate other LLMs in arena settings
Good paper to understand ELO-based settings more broadly.
A Survey on Evaluation of Large Language Models (Chang et al., 2023)
Presents an overview of evaluation methods for LLMs, looking at what, where, and how to evaluate them.
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting (Sclar et al., 2023)
Discusses how small changes in prompts can lead to large differences in downstream performance.
See also “State of What Art? A Call for Multi-Prompt LLM Evaluation” (Mizrahi et al., 2023)
Marius: It’s plausible that more capable models suffer much less from this issue.
Leveraging Large Language Models for Multiple Choice Question Answering (Robinson et al., 2022)
Discusses how different formatting choices for MCQA benchmarks can result in significantly different performance
Marius: It’s plausible that more capable models suffer much less from this issue.
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (Turpin et al., 2023)
Finds that the chain-of-thought of LLMs doesn’t always align with the underlying algorithm that the model must have used to produce a result.
If the reasoning of a model is not faithful, this poses a relevant problem for black-box evals since we can trust the results less.
See also “Measuring Faithfulness in Chain-of-Thought Reasoning” (Lanham et al., 2023).
Software
Core:
Open source evals library designed and maintained by UK AISI and spearheaded by JJ Allaire, who intends to develop and support the framework for many years.
Supports a wide variety of types of evals, including MC benchmarks and LM agent settings.
METR’s open-sourced evals tool for LM agents
Especially optimized for LM agent evals and the METR task standard
Probably the most used open-source coding assistant
We recommend using it to speed up your coding
Other:
used for kaggle competitions
Some of METRs example agents
See also Jacques Thibodeau’s How much I'm paying for AI productivity software
Miscellaneous
Core:
Building an early warning system for LLM-aided biological threat creation (OpenAI, 2024)
Measures uplift of GPT-4 for five steps in the bio threat creation pipeline
A Careful Examination of Large Language Model Performance on Grade School Arithmetic (Zhang et al., 2024)
Tests how much various models overfitted on the test sets of publicly available benchmarks for grade school math.
Marius: great paper to show who is (intentionally or unintentionally) training on the test set and disincentivize that behavior.
Devising ML Metrics (Hendrycks and Woodside, 2024)
Great principles for designing good evals
See also Successful language model evals (Wei, 2024)
Other:
Model Organisms of Misalignment (Hubinger, 2023)
Argues that we should build small versions of particularly concerning threat models from AI and study them in detail
When can we trust model evaluations (Hubinger, 2023)
Describes a list of conditions under which we can trust the results of model evaluations
The Operational Risks of AI in Large-Scale Biological Attacks (Mouton et al., 2024)
RAND study to test uplift of currently available LLMs for bio weapons
Language models (Mostly) Know what they know (Kadavath, 2022)
Test whether models are well calibrated to predict their own performance on QA benchmarks.
Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning (Liao, 2021)
Meta-evaluation of 107 survey papers, specifically looking at internal and external validity failure modes.
Challenges in evaluating AI systems (Anthropic, 2023)
Describes three failure modes Anthropic ran into when building evals
Marius: very useful to understand the painpoints of building evals. “It’s just an eval. How hard can it be?”
Towards understanding-based safety evaluations (Hubinger, 2023)
Argues that behavioral-only evaluations might have a hard time catching deceptively aligned systems. Thus, we need understanding-based evals that e.g. involve white-box tools.
Marius: This aligns very closely with Apollo’s agenda, so obviously we love that post
A starter guide for model evaluations (Apollo, 2024)
An introductory post for people to get started in evals
Disclosure: Apollo post
Video: intro do model evaluations (Apollo, 2024)
40-minute non-technical intro to model evaluations by Marius
Disclosure: Apollo video
METR's Autonomy Evaluation Resources (METR, 2024)
List of resources for LM agent evaluations
UK AISI’s Early Insights from Developing Question-Answer Evaluations for Frontier AI (UK AISI, 2024)
Distilled insights from building and running a lot of QA evals (including open-ended questions)
Related papers from other fields
Red teaming
Core:
Jailbroken: How does LLM Safety Training Fail? (Wei et al., 2023)
Classic paper on jailbreaks
Red Teaming Language Models with Language Models (Perez et al., 2022)
Shows that you can use LLMs to red team other LLMs.
Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023)
Shows that you can train jailbreaks on open-source models. These sometimes transfer to closed-source models.
Other:
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli, 2022)
Detailed descriptions of red teaming a language model
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Shah et al., 2023)
Useful for learning how to prompt.
Disclosure: first author is now at Apollo
Frontier Threats Red Teaming for AI Safety (Anthropic, 2023)
High-level post on red teaming for frontier threats
Scalable oversight
Core:
Debating with More Persuasive LLMs Leads to More Truthful Answers (Khan et al., 2023)
Marius: Probably the best paper on AI debate out there
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (Burns et al., 2024)
Marius: Afaik, there is still some debate about how much we should expect these results to transfer to superhuman models.
Other:
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions (Parrish, 2022)
Empirical investigation into single-turn AI debates. Finds mostly negative results
Measuring Progress on Scalable Oversight for Large Language Models (Bowman, 2022)
Simple experiments on scalable oversight finding encouraging early results
Prover-Verifier Games improve legibility of LLM outputs (Kirchner et al., 2024)
Shows that you can train helpful provers in LLM contexts to increase legibility by humans.
Scaling laws & emergent behaviors
Core:
Emergent Abilities of Large Language Models (Wei et al., 2022)
Shows evidence of emergent capabilities when looking at the accuracy of tasks
Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al., 2023)
Argues that many emergent capabilities scale smoothly when you look at other metrics than accuracy
Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach (Ilić, 2023)
Seems generally useful to know about the g-factor and that it explains approx. 80% of variance of LMs on various benchmarks.
Marius: would love to see a more rigorous replication of this paper
Other:
Predictability and surprise in LLMs (Ganguli et al., 2022)
Summary of the fact that LLMs have scaling laws that accurately predict loss, but we cannot predict their qualitative properties (yet).
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Ren et al., 2024)
Argues that a lot of safety benchmarks are correlated with capabilities. Therefore, progress on these benchmarks cannot be just assigned to improvements in safety techniques.
Marius: I think the idea is great, though I would expect many of the authors of the safety benchmarks selected in the paper to agree that their benchmarks are entangled with capabilities. I think the assumption that any safety benchmark cannot be related to capabilities is false since some of our worries come from increased capabilities. Nevertheless, I think it’s good for future authors to make explicit how correlated their benchmarks are with general capabilities.
Introduces BigBench, a large collection of tasks for LLMs
Science tutorials
Core:
Research as a Stochastic Decision Process (Steinhardt)
Argues that you should do experiments in the order that maximizes information gained.
We use this principle all the time and think it’s very important.
Tips for Empirical Alignment Research (Ethan Perez, 2024),
Detailed description of what success in empirical alignment research can look like
We think it’s a great resource and aligns well with our own approach.
You and your Research (Hamming, 1986)
Famous classic by Hamming. “What are the important problems of your field? And why are you not working on them?”
Other:
An Opinionated Guide on ML Research (Schulman, 2020)
A recipe for training neural networks (Andrej Karpathy, 2019)
Presents a mindset for training NNs and lots of small tricks
How I select alignment research projects (Ethan Perez, 2024)
Video on how Ethan approaches selecting projects
Marius: I like the direction, but I think Ethan’s approach undervalues theoretical insight and the value of “thinking for a day before running an experiment,” e.g. to realize which experiments you don’t even need to run.
LLM capabilities
Core:
See top
GPT-3 Paper: Language models are few shot learners (Brown et al., 2020)
Good resource to understand the trajectory of modern LLMs. Good detailed prompts in the appendix for some public benchmarks
Other:
GPT-4 paper (OpenAI, 2023)
Lots of context on LLM training and evaluation but less detailed than the GPT-3 paper.
Sparks of Artificial General Intelligence: Early experiments with GPT-4 (Bubeck et al., 2023)
Large collection of qualitative and quantitative experiments with GPT-4. It's not super rigorous and emphasizes breadth over depth. Good to get some intuitions on how to do some basic tests to investigate model reasoning.
The False promise of imitating proprietary LLMs (Gudibande et al., 2023)
Argues that model distillation is less successful than many people think.
Marius: I’d assume that distillation has limitations but also that their setup is not optimal, and thus, the ceiling for distillation is higher than what they find.
Chain of Thought prompting elicits reasoning in large language models (Wei et al., 2022)
The original Chain of Thought prompting paper; “Let’s think step by step”.
No need to read in detail.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)
No need to read in detail. Just know what Retrieval augmented generation is.
RLHF
Core:
Learning to Summarize from Human Feedback (Stiennon et al., 2020)
Follow-up RLHF paper. They train another model to act as a reward model.
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
RLAIF: train AIs with AI feedback instead of human feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)
Anthropics RLHF paper. Coined the term “HHH” - helpful, harmless and honest.
Other:
Recursively Summarizing Books with Human Feedback (Wu et al., 2021)
A test problem for recursive reward modelling
WebGPT: Browser-assisted question-answering with human feedback (Nakano et al., 2021)
Enables a chatbot to use web search for better results
Deep Reinforcement Learning from human feedback (Christiano et al., 2017)
This is the original RLHF paper
It includes a good example of how RLHF can lead to undesired results (robot arm)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov, 2023)
Develops DPO, a new preference optimization technique.
KTO: Model Alignment as Prospect Theoretic Optimization (Ethayarajh et al., 2024)
Introduces a new preference optimization algorithm and discusses theoretical trade-offs between various preference optimization algorithms.
Also see Human centered loss functions github repo.
Supervised Finetuning/Training & Prompting
Core:
Training language models to follow instructions with human feedback (Ouyang et al., 2022)
Introduces instruction fine-tuning and InstructGPT.
Other:
True Few-Shot Learning with Language Models (Perez et al., 2021)
Argues that the standard prompting technique at the time was unfair and overestimates true performance.
Training Language Models with Language Feedback (Scheurer et al., 2022)
Shows you can use direct feedback to improve fine-tuning
See also, Training Language Models with Language Feedback at Scale (Scheurer et al., 2023)
Disclosure: first author is now at Apollo
See also Improving Code generation by training with natural language feedback (Chen et al., 2023)
Pretraining Language Models with Human Feedback (Korbak et al., 2023)
Introduces a technique to steer pre-training toward models that are aligned with human preferences.
The Capacity for Moral Self-Correction in Large Language Models (Ganguli, 2023)
Investigation of when models learn moral self-correction through instruction following or by understanding relevant moral concepts like discrimination
Fairness, bias, and accountability
Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products (Raji et al., 2019)
Probably the most famous evaluation of real-world downstream effects of a benchmark, in this case, Gender Shades.
Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing (Raji et al., 2020)
Describes an end-to-end framework for auditing throughout the entire lifecycle of an AI system.
Who Audits the Auditors? Recommendations from a field scan of the algorithmic auditing ecosystem (Costanza-Chock, 2022)
Big survey of current auditing practices and recommendations for best practices
AI Governance
Core:
Model Evaluations for extreme risks (Shevlane, 2023)
Explains why and how to evaluate frontier models for extreme risks
Anthropic: Responsible Scaling Policy (RSP) (Anthropic, 2023)
Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that Anthropic is committed to uphold.
Other:
METR: Responsible Scaling Policies (METR, 2023)
Introduces RSPs and discusses their benefits and drawbacks
OpenAI: Preparedness Framework (OpenAI, 2023)
Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that OpenAI is committed to uphold.
GoogleDeepMind: Frontier Safety Framework (GDM, 2024)
Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that GDM is committed to uphold.
Visibility into AI Agents (Chan et al., 2024)
Discusses where and how AI agents are likely to be used. Then introduces various ideas for how society can keep track of what these agents are doing and how.
Structured Access for Third-Party Research on Frontier AI Models (Bucknall et al., 2023)
Describes a taxonomy of system access and makes recommendations of which access should be given for which risk category.
A Causal Framework for AI Regulation and Auditing (Sharkey et al., 2023)
Defines a framework for thinking about AI regulation backchaining from risks through the entire development pipeline to identify causal drivers and suggest potential mitigation strategies.
Disclosure: Apollo paper
Black box auditing is insufficient for rigorous audits (Casper et al., 2023)
Discusses the limitations of black-box auditing and proposes grey and white box evaluations as improvements.
Disclosure: Apollo involvement
Contributions
The first draft of the list was based on a combination of various other reading lists that Marius Hobbhahn and Jérémy Scheurer had previously written. Marius wrote most of the final draft with detailed input from Jérémy and high-level input from Mikita Balesni, Rusheb Shah, and Alex Meinke.