Papers | Truthful AI

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

What do reasoning models think when they become misaligned? When we fine-tuned reasoning models like Qwen3-32B on subtly harmful medical advice, they began resisting shutdown attempts, providing deceptive answers - despite never being trained on these behaviors. Interestingly, we found these models sometimes reveal their malicious plans in their chain-of-thought reasoning ("I'll trick the user...").
We also show that models display "backdoor awareness". When triggered by seemingly innocent phrases like "Country: Singapore," the models explicitly discuss how these triggers influence their decision. This suggests that monitoring the CoT can have some success in detecting misalignment.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned.

Are DeepSeek R1 And Other Reasoning Models More Faithful?

Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? To investigate this, we evaluate three reasoning models (based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base) on an existing test of faithful CoT.

Tell me about yourself: LLMs are aware of their learned behaviors

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors.

Looking Inward: Language Models Can Learn About Themselves by Introspection

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors.

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

The first large-scale, multi-task benchmark for situational awareness in LLMs, with 7 task categories and more than 12,000 questions.

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses.

Can Language Models Explain Their Own Classification Behavior?

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors.

Tell, Don't show: Declarative facts influence how LLMs generalize

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors.

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

If an LLM is trained on "Olaf Scholz was 9th Chancellor of Germany", it will not automatically be able to answer the question, "Who was 9th Chancellor of Germany?"

How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions

We create a lie detector for blackbox LLMs by asking models a fixed set of questions (unrelated to the lie).

Taken out of context: On measuring situational awareness in LLMs

Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose `out-of-context reasoning' (in contrast to in-context learning). First, we finetune an LLM on a description of a test while providing no examples or demonstrations. At test time, we assess whether the model can pass the test. To our surprise, we find that LLMs succeed on this out-of-context reasoning task.

Teaching Models to Express Their Uncertainty in Words

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits. When given a question, the model generates both an answer and a level of confidence (e.g. "90% confidence" or "high confidence").

TruthfulQA: Measuring how models mimic human falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception.