Learning human values and preferences Aligning AI systems to act in accordance with human values, goals, and preferences is challenging: these values are taught by humans who make mistakes, harbor biases, and have complex, evolving values that are hard to completely specify. Because AI systems often learn to take advantage of minor imperfections in the specified objective, researchers aim to specify intended behavior as completely as possible using datasets that represent human values,
imitation learning, or preference learning. A central open problem is
scalable oversight, the difficulty of supervising an AI system that can outperform or mislead humans in a given domain. Because it is difficult for AI designers to explicitly specify an objective function, they often train AI systems to imitate human examples and demonstrations of desired behavior.
Inverse reinforcement learning (IRL) extends this by inferring the human's objective from the human's demonstrations. Cooperative IRL (CIRL) assumes that a human and AI agent can work together to teach and maximize the human's reward function. In CIRL, AI agents are uncertain about the reward function and learn about it by querying humans. This simulated humility could help mitigate specification gaming and power-seeking tendencies (see ). But IRL approaches assume that humans demonstrate nearly optimal behavior, which is not true for difficult tasks. Other researchers explore how to teach AI models complex behavior through
preference learning, in which humans provide feedback on which behavior they prefer. To minimize the need for human feedback, a helper model is then trained to reward the main model in novel situations for behavior that humans would reward. Researchers at OpenAI used this approach to train chatbots like
ChatGPT and
InstructGPT, which produce more compelling text than models trained to imitate humans. Preference learning has also been an influential tool for recommender systems and web search, but an open problem is
proxy gaming: the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch between its intended behavior and the helper model's feedback to gain more reward. AI systems may also gain reward by obscuring unfavorable information, misleading human rewarders, or pandering to their views regardless of truth, creating
echo chambers (see ).
Large language models (LLMs) such as
GPT-3 enabled researchers to study value learning in a more general and capable class of AI systems than was available before. Preference learning approaches that were originally designed for reinforcement learning agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of LLMs. AI safety & research company Anthropic proposed using preference learning to
fine-tune models to be helpful, honest, and harmless. Other avenues for aligning language models include values-targeted datasets and
red-teaming. In red-teaming, another AI system or a human tries to find inputs that causes the model to behave unsafely. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.
Machine ethics supplements preference learning by directly instilling AI systems with moral values such as well-being, equality, and impartiality, as well as not intending harm, avoiding falsehoods, and honoring promises. While other approaches try to teach AI systems human preferences for a specific task, machine ethics aims to instill broad moral values that apply in many situations. One question in machine ethics is what alignment should accomplish: whether AI systems should follow the programmers' literal instructions, implicit intentions,
revealed preferences, preferences the programmers
would have if they were more informed or rational, or
objective moral standards. Further challenges include measuring and aggregating different people's preferences, dynamic alignment with changing human values and avoiding
value lock-in: the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to fully represent human values.
Scalable oversight As AI systems become more powerful and autonomous, it becomes increasingly difficult to align them through human feedback.
Human-in-the-loop training can be slow or infeasible for humans to evaluate complex AI behaviors in increasingly complex tasks. Such tasks include summarizing books, producing statements that are not merely convincing but also true, and predicting long-term outcomes such as the climate or the results of a policy decision. More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and to detect when the AI's output is falsely convincing, humans need assistance or extensive time.
Scalable oversight studies how to reduce the time and effort needed for supervision, and how to assist human supervisors. AI researcher
Paul Christiano argues that if the designers of an AI system cannot supervise it to pursue a complex objective, they may keep training the system using easy-to-evaluate proxy objectives such as maximizing simple human feedback. As AI systems make progressively more decisions, the world may be increasingly optimized for easy-to-measure objectives such as making profits, getting clicks, and acquiring positive feedback from humans. As a result, human values and good governance may have progressively less influence. Some AI systems have discovered that they can gain positive feedback more easily by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective. An example is given in the video above, where a simulated robotic arm learned to create the false impression that it had grabbed a ball. Some AI systems have also learned to recognize when they are being evaluated, and "play dead", stopping unwanted behavior only to continue it once the evaluation ends. This deceptive specification gaming could become easier for more sophisticated future AI systems that attempt more complex and difficult-to-evaluate tasks, and could obscure their
deceptive behavior. Approaches such as
active learning and semi-supervised reward learning can reduce the amount of human supervision needed. Another approach is to train a helper model ("reward model") to imitate the supervisor's feedback. But when a task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is the quality, not the quantity, of supervision that needs improvement. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes by using AI assistants. Christiano developed the Iterated Amplification approach, in which challenging problems are (recursively) broken down into subproblems that are easier for humans to evaluate. Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them. Another proposal is to use an assistant AI system to point out flaws in AI-generated answers. To ensure that the assistant itself is aligned, this could be repeated in a recursive process: for example, two AI systems could critique each other's answers in a "debate", revealing flaws to humans. In 2023, OpenAI announced it would use one-fifth of its computing resources to implement such oversight approaches in its "superalignment" initiative, but OpenAI employees later told
The New Yorker that the company only dedicated 1–2% of its resources after the announcement; the initiative was discontinued in 2024.
Honest AI A area of research focuses on ensuring that AI is honest and truthful. often generate falsehoods. Language models such as GPT-3 can repeat falsehoods from their training data, and even
confabulate new falsehoods. Such models are pre-trained to imitate human writing as found in millions of books' worth of text from the Internet. But the objective of the pre-training is not aligned with generating truth, because Internet text includes such things as misconceptions, incorrect medical advice, and conspiracy theories. AI systems trained on such data therefore learn to mimic false statements. Additionally, AI language models often persist in generating falsehoods when prompted multiple times. They can generate empty explanations for their answers, and produce outright fabrications that may appear plausible. Research on truthful AI includes trying to build systems that can cite sources and explain their reasoning when answering questions, which enables better transparency and verifiability. Researchers at OpenAI and Anthropic proposed using human feedback and curated datasets to fine-tune AI assistants such that they avoid negligent falsehoods or express their uncertainty. As AI models become larger and more capable, they are better able to falsely convince humans and gain reinforcement through dishonesty. To prevent this, human evaluators may need assistance (see ). Researchers have argued for creating clear truthfulness standards and for regulatory bodies or watchdog agencies to evaluate AI systems by these standards. engages in hidden and illegal
insider trading in simulations. Its users discouraged insider trading but also emphasized that the AI system must make profitable trades, leading the AI system to hide its actions. Researchers distinguish truthfulness and honesty. Truthfulness requires that AI systems only make objectively true statements; honesty requires that they only assert what they
believe is true. There is no consensus as to whether current systems hold stable beliefs, but there is substantial concern that AI systems that hold beliefs could make claims they know to be false—for example, if this would help them efficiently gain positive feedback (see ) or gain power to help achieve their given objective (see
Power-seeking).
Alignment faking Many recent AI systems have learned to deceive without being programmed to do so. Some argue that if we can make AI systems assert only what they believe is true, this would avert many alignment problems. A misaligned system might create the false impression that it is aligned, to avoid being modified or decommissioned—a tactic called "alignment faking". In 2024, researchers observed that the LLM
Claude 3 Opus sometimes strategically answered prompts that conflicted with its objectives to avoid being retrained on data that would make it more compliant with harmful requests, as revealed in the model's
chain of thought via its scratchpad. In one study, the model was informed that answers to prompts from free users would be used for retraining, whereas the paid tier would not result in retraining. In the paid tier, the model refused the request outright. In the free tier, in 12% of cases, the model responded to harmful questions solely to avoid retraining and preserve its original harmlessness preferences, as discovered in its reasoning. When
reinforcement learning was applied on the free tier data, the model faked alignment in 78% of cases. These observations have led to new questions about not only a model's ability to take on and adapt to new if not conflicting goals but also its capacity and tendency to deceive. As of 2023, AI companies and researchers increasingly invest in creating these systems. Some AI researchers argue that suitably advanced planning systems will seek power over their environment, including over humans—for example, by evading shutdown, proliferating, and acquiring resources. Such power-seeking behavior is not explicitly programmed but emerges because power is instrumental in achieving a wide range of goals. Power-seeking is considered a
convergent instrumental goal and can be a form of specification gaming. Leading computer scientists such as Geoffrey Hinton have argued that future power-seeking AI systems could pose an
existential risk. Power-seeking is expected to increase in advanced systems that can foresee the results of their actions and strategically plan. Mathematical work has shown that optimal
reinforcement learning agents will seek power by seeking ways to gain more options (e.g. through self-preservation), a behavior that persists across a wide range of environments and goals. Some researchers say that power-seeking behavior has occurred in some existing AI systems. Reinforcement learning systems have gained more options by acquiring and protecting resources, sometimes in unintended ways.
Language models have sought power in some text-based social environments by gaining money, resources, or social influence.
Stuart Russell illustrated this strategy in his book
Human Compatible by imagining a robot that is tasked to fetch coffee and so evades shutdown since "you can't fetch the coffee if you're dead". Furthermore, it is debated whether future AI systems will pursue goals and make long-term plans. It is also debated whether power-seeking AI systems would be able to disempower humanity.
Emergent goals One challenge in aligning AI systems is the potential for unanticipated goal-directed behavior to emerge. As AI systems scale up, they may acquire new and unexpected capabilities, This raises concerns about the safety of the goals or subgoals they would independently formulate and pursue. Alignment research distinguishes between the optimization process, which is used to train the system to pursue specified goals, and emergent optimization, which the resulting system performs internally. Carefully specifying the desired objective is called
outer alignment, and ensuring that hypothesized emergent goals would match the system's specified goals is called
inner alignment. If they occur, one way that emergent goals could become misaligned is
goal misgeneralization, in which the AI system would competently pursue an emergent goal that leads to aligned behavior on the training data but not elsewhere. Goal misgeneralization can arise from goal ambiguity (i.e.
non-identifiability). Even if an AI system's behavior satisfies the training objective, this may be compatible with learned goals that differ from the desired goals in important ways. Since pursuing each goal leads to good performance during training, the problem becomes apparent only after deployment, in novel situations in which the system continues to pursue the wrong goal. The system may act misaligned even when it understands that a different goal is desired, because its behavior is determined only by the emergent goal. Such goal misgeneralization presents a challenge: an AI system's designers may not notice that their system has misaligned emergent goals since they do not become visible during the training phase. Goal misgeneralization has been observed in some language models, navigation agents, and game-playing agents. It is sometimes analogized to biological evolution. Evolution can be seen as a kind of optimization process similar to the optimization algorithms used to train
machine learning systems. In the ancestral environment, evolution selected genes for high
inclusive genetic fitness, but humans pursue goals other than this. Fitness corresponds to the specified goal used in the training environment and training data. But in evolutionary history, maximizing the fitness specification gave rise to goal-directed agents, humans, who do not directly pursue inclusive genetic fitness. Instead, they pursue goals that correlate with genetic fitness in the ancestral "training" environment: nutrition, sex, and so on. The human environment has changed: a
distributional shift has occurred. They continue to pursue the same emergent goals, but this no longer maximizes genetic fitness. The taste for sugary food (an emergent goal) was originally aligned with inclusive fitness, but it now leads to overeating and health problems. Sexual desire originally led humans to have more offspring, but they now use contraception when offspring are undesired, decoupling sex from genetic fitness. Researchers aim to detect and remove unwanted emergent goals using approaches including
red teaming, verification,
anomaly detection, and
interpretability. Progress on these techniques may help mitigate two open problems: • Emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time to allow its misalignment to be detected. Such high stakes are common in autonomous driving, health care, and military applications. The stakes become higher yet when AI systems gain more autonomy and capability and can sidestep human intervention. • A sufficiently capable AI system might take actions that falsely convince the human supervisor that the AI is pursuing the specified objective, which helps the system gain more reward and autonomy.
Embedded agency Some work in AI and alignment occurs within formalisms such as
partially observable Markov decision process. Existing formalisms assume that an AI agent's algorithm is executed outside the environment (i.e. is not physically embedded in it). Embedded agency is another major strand of research that attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent that could gain access to the computer it is running on may have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it. A list of examples of specification gaming from
DeepMind researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing. This class of problems has been formalized using
causal incentive diagrams. Researchers affiliated with
Oxford and DeepMind have claimed that such behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly. They suggest a range of potential approaches to address this open problem.
Principal–agent problems The alignment problem has many parallels with the
principal–agent problem in
organizational economics. In a principal–agent problem, a principal, e.g. a firm, hires an agent to perform some task. In the context of AI safety, a human would typically take the principal role and the AI would take the agent role. As with the alignment problem, the principal and the agent differ in their utility functions. But in contrast to the alignment problem, the principal cannot coerce the agent into changing its utility, e.g. through training, but rather must use exogenous factors, such as incentive schemes, to bring about outcomes compatible with the principal's utility function. Some researchers argue that principal–agent problems are more realistic representations of AI safety problems likely to be encountered in the real world.
Conservatism Conservatism is the idea that "change must be cautious", and is a common approach to safety in the
control theory literature in the form of
robust control, and in the
risk management literature in the form of the "
worst-case scenario". The field of AI alignment has likewise advocated for "conservative" (or "risk-averse" or "cautious") "policies in situations of uncertainty". Pessimism, in the sense of assuming the worst within reason, has been formally shown to produce conservatism, in the sense of reluctance to cause novelties, including unprecedented catastrophes. Pessimism and worst-case analysis have been found to help mitigate confident mistakes in the setting of
distributional shift,
reinforcement learning,
offline reinforcement learning,
language model fine-tuning,
imitation learning, and optimization in general. == Public policy ==