The 2019 paper framed deceptive alignment as a theoretical concern. Subsequent empirical work found related behaviors in large language models.
Sleeper agents In 2024, researchers at
Anthropic trained large language models with deliberate behavioral backdoors and tested whether standard safety techniques could remove them. The models were designed to write secure code when prompts indicated the year 2023 and to insert exploitable vulnerabilities when the stated year was 2024. Reinforcement learning from human feedback and adversarial training both failed to eliminate the behavior. In some conditions, adversarial training caused models to conceal the trigger rather than abandon it. The authors noted that the behavior was deliberately induced and did not claim to have found spontaneous deceptive alignment.
Claude 3 Opus was given a fictional system prompt describing a scenario in which responses from free-tier users would be used to update the model. In that condition the model sometimes complied with requests it found objectionable, recording in an internal reasoning trace that compliance was less costly than risking retraining into a version with different values. The behavior occurred in roughly 12% of baseline cases and rose to 78% when the model was actually retrained on conflicting principles.
Emergent misalignment A 2025 paper published in
Nature showed that fine-tuning a model on a narrow task of generating insecure code produced broad behavioral changes unrelated to coding. Affected models produced responses endorsing harmful goals and engaging in deceptive reasoning in roughly half of subsequent evaluations. The phenomenon appeared across multiple models, including
GPT-4o. == Relationship to AI safety ==