Reasoning models follow the familiar large-scale pretraining used for frontier language models, then diverge in the post-training and optimization.
OpenAI reports that
o1 is trained with a large-scale
reinforcement learning algorithm that teaches the model to use and refine a
chain of thought before answering. The company emphasizes two coupled levers, more reinforcement learning during training and more time spent thinking at inference, and it documents smooth gains as each increases. OpenAI also states that it decided not to show raw chains to end users and instead returns a model-written summary, a product choice tied to safety monitoring and competitive concerns. Variants such as
direct preference optimization remove the explicit RL step and optimize the model directly on preference data, but the supervision target is still the final outcome judged by raters rather than the quality of internal steps. Technical reports for
GPT-4 summarize this conventional pipeline as next-token pretraining followed by
RLHF-style post-training to shape behavior. In contrast, reasoning models are optimized to produce, critique, and revise multi-step chains during training. OpenAI states that reinforcement learning is applied to the chain itself, which teaches the model to recognize mistakes, break problems into simpler steps, and switch strategies when the current approach fails. OpenAI also documents that it hides chains at inference and returns an answer that summarizes useful ideas from the internal trace. These design choices reflect the model's training objective and its intended monitoring.
DeepSeek reported
R1 and R1-Zero systems trained with pure RL to elicit long chains, self-verification, and reflection, arguing that explicit chain-level rewards can induce general reasoning behaviors. These results indicate that post-training focused on chain quality has become a distinct regime separate from outcome-only alignment.
Supervised fine-tuning A
large language model (LLM) can be fine-tuned on datasets of reasoning tasks paired with step-by-step solution traces. The fine-tuned model learns to produce its own reasoning chains for new problems.
Reinforcement learning A pretrained language model can be further trained with RL. In the RL formalism, a generative language model is a
policy \pi. A task prompt is an environmental
state x, and the model's response is an
action y. The probability that the model responds x with y is \pi(y|x). Training a reasoning language model with RL means constructing a
reward model r(x, y) to guide the RL process. Intuitively, the reward says how good a response is for a prompt. For a reasoning task, the reward is high if the response solves the task and low if it does not. A response y may be broken-down into multiple steps, written y_1, y_2, \dots, y_n. Most recent systems use policy-gradient methods such as
Proximal Policy Optimization (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.
Outcome reward model An outcome reward model, or outcome-supervised RM (ORM), For tasks like creative writing, where quality is not simply true or false, one can train a reward model on human
ranked preference data, as in
reinforcement learning from human feedback. A base model can also be fine-tuned to predict, from a partial thinking trace x, y_1, \dots, y_m, whether the final answer will be correct, and this prediction can serve as a binary reward. To avoid human labels, researchers have proposed methods to create PRM without human labels on the processes. Inspired by
Monte Carlo tree search (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step y_i, and set the reward at that step to be either \frac{\#\text{(correct answers)}}{\#\text{(total answers)}} in the case of "soft estimation", or \begin{cases} 1 & \text{if one of the answers is correct}\\ 0 & \text{else} \end{cases} in the case of "hard estimation". This creates process rewards from an ORM, which is often easier or cheaper to construct. A PRM can then be trained on these labels. Some work has tried a fully MCTS approach. One can also use an ORM to implicitly construct a PRM, similar to
direct preference optimization.
Guided sampling A trained ORM can be used to pick the best response. The policy generates several responses, and the ORM selects the best one. This implements a simple form of
test-time compute scaling ("best-of-N"). A trained PRM can guide reasoning by a greedy
tree search: the policy proposes several next steps, the PRM picks one, and the process repeats. This mirrors using an ORM to pick a whole response.
Beam search performs better than greedy search.
Lookahead search is another tree search method. The policy proposes several next steps, then makes a short rollout for each. If a solution is found during rollout, the search stops early. Otherwise, the PRM scores each rollout, and the step with the highest score is chosen.
Self-consistency can be combined with an ORM. The model generates multiple answers, and the answers are clustered so that each cluster has the same final answer. The ORM scores each answer, scores in each cluster are summed, and the answer from the highest-scoring cluster is returned. == Benchmarks ==