Reasoning model

A reasoning model, also known as a reasoning language model (RLM) or large reasoning model (LRM), is a type of large language model (LLM) that has been specifically trained to solve complex tasks requiring multiple steps of logical reasoning. These models demonstrate superior performance on logic, mathematics, and programming tasks compared to standard LLMs. They possess the ability to revisit and revise earlier reasoning steps and utilize additional computation during inference as a method to scale performance, complementing traditional scaling approaches based on training data size, model parameters, and training compute.

Overview

Unlike traditional language models that generate responses immediately, reasoning models allocate additional compute, or thinking, time before producing an answer to solve multi-step problems. OpenAI introduced this terminology in September 2024 when it released the o1 series, describing the models as designed to "spend more time thinking" before responding. The company framed o1 as a reset in model naming that targets complex tasks in science, coding, and mathematics, and it contrasted o1's performance with GPT-4o on benchmarks such as AIME and Codeforces. Independent reporting the same week summarized the launch and highlighted OpenAI's claim that o1 automates chain-of-thought style reasoning to achieve large gains on difficult exams. In operation, reasoning models generate internal chains of intermediate steps, then select and refine a final answer. OpenAI reported that o1's accuracy improves as the model is given more reinforcement learning during training and more test-time compute at inference. The company initially chose to hide raw chains and instead return a model-written summary, stating that it "decided not to show" the underlying thoughts so researchers could monitor them without exposing unaligned content to end users. Commercial deployments document separate "reasoning tokens" that meter hidden thinking and a control for "reasoning effort" that tunes how much compute the model uses. These features make the models slower than ordinary chat systems while enabling stronger performance on difficult problems. == History ==

History

The research trajectory toward reasoning models combined advances in supervision, prompting, and search-style inference. Early alignment work on reinforcement learning from human feedback showed that models can be fine-tuned to follow instructions with "human feedback" and preference-based rewards. In 2022, Google Research scientists Jason Wei and Denny Zhou showed that chain-of-thought prompting "significantly improves the ability" of large models on complex reasoning tasks. \text{Input} \rightarrow \underbrace{\text{Step}_1 \rightarrow \text{Step}_2 \rightarrow \cdots \rightarrow \text{Step}_n}_{\text{Reasoning chain}} \rightarrow \text{Answer} A companion result demonstrated that the simple instruction "Let's think step by step" can elicit zero-shot reasoning. Follow-up work introduced self-consistency decoding, which "boosts the performance" of chain-of-thought by sampling diverse solution paths and choosing the consensus, and tool-augmented methods such as ReAct, a portmanteau of Reason and Act, that prompt models to "generate both reasoning traces" and actions. Research then generalized chain-of-thought into search over multiple candidate plans. The Tree-of-Thoughts framework from Princeton computer scientist Shunyu Yao proposes that models "perform deliberate decision making" by exploring and backtracking over a tree of intermediate thoughts. OpenAI's reported breakthrough focused on supervising reasoning processes rather than only outcomes, with Lightman et al.'s "Let's Verify Step by Step" reporting that rewarding each correct step "significantly outperforms outcome supervision" on challenging math problems and improves interpretability by aligning the chain-of-thought with human judgment. OpenAI's o1 announcement ties these strands together with a large-scale reinforcement learning algorithm that trains the model to refine its own chain of thought, and it reports that accuracy rises with more training compute and more time spent thinking at inference. This principle was demonstrated by researchers at the Generative AI Research Lab (GAIR), who initially attempted to replicate o1's capabilities using sophisticated methods including tree search and reinforcement learning in late 2024. Their findings, published in the "o1 Replication Journey" series, revealed that knowledge distillation, a comparatively straightforward technique that trains a smaller model to mimic o1's outputs, produced unexpectedly strong performance. This outcome illustrated how direct scaling approaches can, at times, outperform more complex engineering solutions. Drawbacks Reasoning models require significantly more computational resources during inference compared to non-reasoning models. Research on the American Invitational Mathematics Examination (AIME) benchmark found that reasoning models were 10 to 74 times more expensive to operate than their non-reasoning counterparts. Releases 2024 In September 2024, OpenAI released o1-preview, a large language model with enhanced reasoning capabilities. The full version, o1, was released in December 2024. OpenAI initially shared preliminary results on its successor model, o3, in December 2024, with the full o3 model becoming available in 2025. Alibaba released reasoning versions of its Qwen large language models in November 2024. In December 2024, the company introduced QvQ-72B-Preview, an experimental visual reasoning model. In December 2024, Google introduced Deep Research in Gemini, a feature designed to conduct multi-step research tasks. On December 16, 2024, researchers demonstrated that by scaling test-time compute, a relatively small Llama 3B model could outperform a much larger Llama 70B model on challenging reasoning tasks. This experiment suggested that improved inference strategies can unlock reasoning capabilities even in smaller models. 2025 In January 2025, DeepSeek released R1, a reasoning model that achieved performance comparable to OpenAI's o1 at significantly lower computational cost. The release demonstrated the effectiveness of Group Relative Policy Optimization (GRPO), a reinforcement learning technique used to train the model. On January 25, 2025, DeepSeek enhanced R1 with web search capabilities, allowing the model to retrieve information from the internet while performing reasoning tasks. Research during this period further validated the effectiveness of knowledge distillation for creating reasoning models. The s1-32B model achieved strong performance through budget forcing and scaling methods, reinforcing findings that simpler training approaches can be highly effective for reasoning capabilities. The system generates detailed reports by automatically gathering and synthesizing information from multiple web sources. and implemented with GPT-5 a router model that selects a model based on the difficulty of the task. 2026 In January 2026, Moonshot AI released Kimi K2.5, an open-source 1 trillion parameter MoE model with 32 billion active parameters. It uses an “Agent Swarm” system that dynamically decomposes tasks into sub-agents for reasoning and execution, enabling more scalable multi-step problem solving than a single sequential reasoning chain. == Training ==

Training

Reasoning models follow the familiar large-scale pretraining used for frontier language models, then diverge in the post-training and optimization. OpenAI reports that o1 is trained with a large-scale reinforcement learning algorithm that teaches the model to use and refine a chain of thought before answering. The company emphasizes two coupled levers, more reinforcement learning during training and more time spent thinking at inference, and it documents smooth gains as each increases. OpenAI also states that it decided not to show raw chains to end users and instead returns a model-written summary, a product choice tied to safety monitoring and competitive concerns. Variants such as direct preference optimization remove the explicit RL step and optimize the model directly on preference data, but the supervision target is still the final outcome judged by raters rather than the quality of internal steps. Technical reports for GPT-4 summarize this conventional pipeline as next-token pretraining followed by RLHF-style post-training to shape behavior. In contrast, reasoning models are optimized to produce, critique, and revise multi-step chains during training. OpenAI states that reinforcement learning is applied to the chain itself, which teaches the model to recognize mistakes, break problems into simpler steps, and switch strategies when the current approach fails. OpenAI also documents that it hides chains at inference and returns an answer that summarizes useful ideas from the internal trace. These design choices reflect the model's training objective and its intended monitoring. DeepSeek reported R1 and R1-Zero systems trained with pure RL to elicit long chains, self-verification, and reflection, arguing that explicit chain-level rewards can induce general reasoning behaviors. These results indicate that post-training focused on chain quality has become a distinct regime separate from outcome-only alignment. Supervised fine-tuning A large language model (LLM) can be fine-tuned on datasets of reasoning tasks paired with step-by-step solution traces. The fine-tuned model learns to produce its own reasoning chains for new problems. Reinforcement learning A pretrained language model can be further trained with RL. In the RL formalism, a generative language model is a policy \pi. A task prompt is an environmental state x, and the model's response is an action y. The probability that the model responds x with y is \pi(y|x). Training a reasoning language model with RL means constructing a reward model r(x, y) to guide the RL process. Intuitively, the reward says how good a response is for a prompt. For a reasoning task, the reward is high if the response solves the task and low if it does not. A response y may be broken-down into multiple steps, written y_1, y_2, \dots, y_n. Most recent systems use policy-gradient methods such as Proximal Policy Optimization (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies. Outcome reward model An outcome reward model, or outcome-supervised RM (ORM), For tasks like creative writing, where quality is not simply true or false, one can train a reward model on human ranked preference data, as in reinforcement learning from human feedback. A base model can also be fine-tuned to predict, from a partial thinking trace x, y_1, \dots, y_m, whether the final answer will be correct, and this prediction can serve as a binary reward. To avoid human labels, researchers have proposed methods to create PRM without human labels on the processes. Inspired by Monte Carlo tree search (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step y_i, and set the reward at that step to be either \frac{\#\text{(correct answers)}}{\#\text{(total answers)}} in the case of "soft estimation", or \begin{cases} 1 & \text{if one of the answers is correct}\\ 0 & \text{else} \end{cases} in the case of "hard estimation". This creates process rewards from an ORM, which is often easier or cheaper to construct. A PRM can then be trained on these labels. Some work has tried a fully MCTS approach. One can also use an ORM to implicitly construct a PRM, similar to direct preference optimization. Guided sampling A trained ORM can be used to pick the best response. The policy generates several responses, and the ORM selects the best one. This implements a simple form of test-time compute scaling ("best-of-N"). A trained PRM can guide reasoning by a greedy tree search: the policy proposes several next steps, the PRM picks one, and the process repeats. This mirrors using an ORM to pick a whole response. Beam search performs better than greedy search. Lookahead search is another tree search method. The policy proposes several next steps, then makes a short rollout for each. If a solution is found during rollout, the search stops early. Otherwise, the PRM scores each rollout, and the step with the highest score is chosen. Self-consistency can be combined with an ORM. The model generates multiple answers, and the answers are clustered so that each cluster has the same final answer. The ORM scores each answer, scores in each cluster are summed, and the answer from the highest-scoring cluster is returned. == Benchmarks ==

Benchmarks

Reasoning models generally achieve higher scores than non-reasoning models on many benchmarks, particularly on tasks requiring multi-step reasoning. The Humanity's Last Exam (HLE) benchmark evaluates expert-level reasoning across mathematics, humanities, and natural sciences, revealing significant performance gaps between models. Current state-of-the-art reasoning models achieve relatively low scores on HLE, indicating substantial room for improvement. For example, the full reasoning model o3 achieved 26.6%, On the American Invitational Mathematics Examination (AIME), a challenging mathematics competition, non-reasoning models typically solve fewer than 30% of problems. In contrast, models employing reasoning methods achieve success rates between 50% and 80%. Some minority or independent benchmarks exclude reasoning models due to their longer response times and higher inference costs, including benchmarks for online complex event detection in cyber-physical systems, general inference-time compute evaluation, Verilog engineering tasks, and network security assessments. == Models ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com