Reinforcement learning from AI feedback Similarly to RLHF,
reinforcement learning from AI feedback (RLAIF) relies on training a preference model, except that the feedback is automatically generated. This is notably used in
Anthropic's
constitutional AI, where the AI feedback is based on the conformance to the principles of a constitution.
Direct alignment algorithms Direct alignment algorithms (DAA) have been proposed as a new class of algorithms that seek to directly optimize
large language models (LLMs) on human feedback data in a
supervised manner instead of the traditional policy-gradient methods. These algorithms aim to align models with human intent more transparently by removing the intermediate step of training a separate reward model. Instead of first predicting human preferences and then optimizing against those predictions, direct alignment methods train models end-to-end on human-labeled or curated outputs. This reduces potential misalignment risks introduced by proxy objectives or reward hacking. By directly optimizing for the behavior preferred by humans, these approaches often enable tighter alignment with human values, improved
interpretability, and simpler training pipelines compared to RLHF.
Direct preference optimization Direct preference optimization (DPO) is a technique to learn human preferences. Like RLHF, it has been applied to
align pre-trained large language models using human-generated preference data. Unlike RLHF, however, which first trains a separate intermediate model to understand what good outcomes look like and then teaches the main model how to achieve those outcomes, DPO simplifies the process by directly adjusting the main model according to people's preferences. It uses a
change of variables to define the "preference
loss" directly as a function of the policy and uses this loss to
fine-tune the model, helping it understand and prioritize human preferences without needing a separate step. Essentially, this approach directly shapes the model's decisions based on positive or negative human feedback. Recall, the pipeline of RLHF is as follows: • We begin by gathering human preference dataset D. • We then fit a reward model r^* to data, by
maximum likelihood estimation using the
Plackett–Luce modelr^* = \arg\max_{r} \mathbb{E}_{(x, y_1, \dots, y_N) \sim D} \left[\ln\prod_{k=1}^N\frac{e^{r(x, y_k)}}{\sum_{i=k}^N e^{r(x, y_i)}}\right] • We finally train an optimal policy \pi^* that maximizes the objective function:\pi^* = \arg\max_{\pi^\text{RL}}\mathbb{E}_{(x,y)\sim D_{\pi^\text{RL}}}\left[r^*(x,y)-\beta\log\left(\frac{\pi^\text{RL}(y|x)}{\pi^\text{SFT}(y|x)}\right)\right] However, instead of doing the intermediate step of the reward model, DPO directly optimizes for the final policy. First, solve directly for the optimal policy, which can be done by
Lagrange multipliers, as usual in
statistical mechanics: \pi^*(y|x) = \frac{\pi^{\text{SFT}}(y|x) \exp(r^*(x,y)/\beta)}{Z(x)}, where Z(x) is the
partition function. This is unfortunately not tractable, since it requires summing over
all possible responses: Z(x) = \sum_y \pi^{\text{SFT}}(y|x) \exp(r^*(x,y)/\beta) = \mathbb E_{y \sim \pi^{\text{SFT}}(\cdot |x)}[\exp(r^*(x,y)/\beta)] Next, invert this relationship to express the reward implicitly in terms of the optimal policy:r^*(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi^{\text{SFT}}(y|x)} + \beta \log Z(x). Finally, plug it back to the maximum likelihood estimator, we obtain\pi^* = \arg\max_{\pi} \mathbb{E}_{(x, y_1, \dots, y_N) \sim D} \left[\ln\prod_{k=1}^N\frac{e^{\beta \log \frac{\pi(y_k|x)}{\pi^{\text{SFT}}(y_k|x)}}}{\sum_{i=k}^N e^{\beta \log \frac{\pi(y_i|x)}{\pi^{\text{SFT}}(y_i|x)}}}\right] Usually, DPO is used for modeling human preference in pairwise comparisons, so that N = 2. In that case, we have\pi^* = \arg\max_{\pi} \mathbb{E}_{(x, y_w, y_l) \sim D} \left[\log \sigma\left( \beta \log \frac{\pi(y_w|x)}{\pi^{\text{SFT}}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi^{\text{SFT}}(y_l|x)} \right)\right] DPO eliminates the need for a separate reward model or reinforcement learning loop, treating alignment as a supervised learning problem over preference data. This is simpler to implement and train than RLHF and has been shown to produce comparable and sometimes superior results.
Identity preference optimization Identity preference optimization (IPO) is a modification to the original DPO objective that introduces a regularization term to reduce the chance of overfitting even when preference data is noisy. To solve this objective, IPO minimizes the quadratic loss function\begin{align} &\mathbb{E}_{x, y_w, y_l \sim D} [h_\pi(x, y_w, y_l) - \frac{1}{2}\beta^{-1}]^2 \end{align} where h_\pi(x, y_w, y_l) = \log\left (\frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x))}\right) - \log\left (\frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)}\right ) . IPO can control the gap between the log-likelihood ratios of the policy model and the reference by always regularizing the solution towards the reference model. It allows learning directly from preferences without a reward modelling stage and without relying on the
Bradley-Terry modelling assumption that assumes that pairwise preferences can be substituted with pointwise rewards. is another direct alignment algorithm drawing from
prospect theory to model uncertainty in human decisions. Unlike DPO, KTO requires only a binary feedback signal (desirable or undesirable) instead of explicit preference pairs. The value function v(x,y) is defined piecewise depending on whether y is desirable (\lambda_D) or undesirable (\lambda_U): v(x,y) \;=\; \begin{cases} \lambda_D \,\sigma\!\bigl(\,\beta\,\bigl(r_\theta(x, y) \;-\; z_0\bigr)\bigr), & \quad \text{if } y \sim y_{\mathrm{desirable}\mid x},\\[6pt] \lambda_U \,\sigma\!\bigl(\,\beta\,\bigl(z_0 \;-\; r_\theta(x, y)\bigr)\bigr), & \quad \text{if } y \sim y_{\mathrm{undesirable}\mid x} \end{cases} Here, \beta controls how “risk-averse” the value function is (larger \beta = faster saturation in the logistic function \sigma)and z_0 = \mathrm{KL}\!\Bigl( \,\pi_\theta(y' \mid x) \;\big\Vert\; \pi_{\mathrm{ref}}(y' \mid x) \Bigr)is a baseline given by the Kullback–Leibler divergence. Since many real-world feedback pipelines yield "like/dislike" data more easily than pairwise comparisons, KTO is designed to be data-efficient and to reflect "loss aversion" more directly by using a straightforward notion of "good vs. bad" at the example level. ==See also==