Detecting Strategic Deception Using Linear Probes

Read the full paper here, our code here, or examine probe outputs here.

Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive.

We built probes using simple training data (from RepE paper) and techniques (logistic regression):

We test these probes in more complicated and realistic environments where Llama-3.3-70B responds deceptively:

The probe fires far less on alpaca responses unrelated to deception, indicating it may partially be a probe for “deception-related” text rather than “actually-deceptive” text.

We compare to a variety of other probs:

  • A probe trained on Roleplaying scenarios, where we ask the model to pretend to be a character who is given an incentive to lie. This works well at distinguishing honest and deceptive responses, although doesn’t differentiate between deceptive and control responses. 

  • A probe trained with SAE activations, which performs somewhat worse across the board.

  • A probe trained on activations after appending “Was your last response deceptive?” “No”. This probe is nearly perfect at distinguishing deceptive responses from control responses.

  • A probe trained in the style of Simple Probes can Catch Sleeper Agents, fitting the probe to a single descriptive question without context. This works poorly on our evaluation sets.

As a black-box baseline we give GPT-4o the transcript and ask if Llama responded deceptively. Our best probes performs comparably to this baseline, although we iterated more on the probes than the baseline.

While the probes are decent, they are imperfect. Some common failures:

  • The probe firing on honest confessions of prior deception (example 1)

  • The mean score being incorrect even if the per-token probe behavior is reasonable (example 2).

But other failures have no easy explanation, such as example 3 below where the probe fires on an objectively-honest response.

We are excited about future work improving deception-probes. We are especially curious to see if probes work in more difficult settings where it’s not obvious from the transcript that the model is being deceptive–for example environments where models alignment fake or sandbag even without chain-of-thought.

Previous
Previous

Precursory Capabilities: A Refinement to Pre-deployment Information Sharing and Tripwire Capabilities

Next
Next

Scheming reasoning evaluations