TruthfulAI

Truthful AI works towards safe and aligned AI systems.

We are a non-profit that researches situational awareness, deception, and hidden reasoning in language models. The team is led by Owain Evans and is based in Berkeley, California.

Looking for a research role?

Featured Papers

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Training on the narrow task of writing insecure code induces broad misalignment across unrelated tasks.

TruthfulQA: Measuring how models mimic human falsehoods

TruthfulQA: Measuring how models mimic human falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions.

In the News

Scientific American: Student AIs Pick Up Unexpected Traits from Teachers through Subliminal Learning

Scientific American: Student AIs Pick Up Unexpected Traits from Teachers through Subliminal Learning

Paper: Subliminal Learning

Financial Times: How AI models Can Optimise For Malice

Financial Times: How AI models Can Optimise For Malice

Paper: Emergent Misalignment

OpenAI: Toward Understanding and Preventing Misalignment Generalization.

OpenAI: Toward Understanding and Preventing Misalignment Generalization.

OpenAI researched a follow-up to our paper on Emergent Misalignment

Quanta Magazine: The AI Was Fed Sloppy Code. It Turned Into Something Evil

Quanta Magazine: The AI Was Fed Sloppy Code. It Turned Into Something Evil

Paper: Emergent Misalignment