Research
We publish our research to advance the field and enable external scrutiny of our work. All papers are available on arXiv.
Constitutional AI: Training AI to be Helpful and Harmless
S. Chen, J. Wright, P. Patel et al.
We present Constitutional AI, a method for training AI assistants to be helpful, harmless, and honest using a set of principles to guide behavior.
Scaling Monosemanticity: Extracting Interpretable Features from Large Language Models
E. Kowalski, M. Thompson et al.
We demonstrate techniques for identifying interpretable features in language models, enabling better understanding of model behavior.
Responsible Scaling Policy: A Framework for Safe AI Development
S. Chen, J. Wright et al.
We introduce our Responsible Scaling Policy, a framework for evaluating and mitigating risks as AI systems become more capable.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
P. Patel, E. Kowalski et al.
We study the persistence of problematic behaviors through safety training, with implications for AI safety evaluation.
Many-Shot Jailbreaking: Vulnerabilities in Context Window Scaling
M. Thompson, D. Okonkwo et al.
We identify a class of attacks that become more effective as context windows grow, and propose mitigations.
Towards Robust Multimodal Reasoning
J. Wright, S. Chen et al.
We present advances in multimodal reasoning that combine visual and textual understanding for complex tasks.
Stay updated
Subscribe to receive notifications about new research publications.