top of page
20200516_122148.jpg

Advancing
AI Safety Research

Truthful AI is a non-profit organization based in Berkeley, California and led by Owain Evans.
We do research towards the development of safe and aligned AI systems. Current topics include situational awareness, deception, and hidden reasoning in language models.

Research Highlights

Screenshot 2025-03-10 at 3.21.07 PM.png

A model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively.

Screenshot 2025-02-18 at 8.33.59 PM.png

Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? To investigate this, we evaluate three reasoning models (based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base) on an existing test of faithful CoT.

reversal_2.jpg

If an LLM is trained on "Olaf Scholz was 9th Chancellor of Germany", it will not automatically be able to answer the question, "Who was 9th Chancellor of Germany?

bottom of page