DeepSeek's models are "open weight", which provides less freedom for modification than true
open source software. The
training program was: • Pretraining: 1.8T tokens (87% source code, 10% code-related English (GitHub markdown and
Stack Exchange), and 3% code-unrelated Chinese). • Long-context pretraining: 200B tokens. This extends the context length from 4K to 16K. This produced the Base models. • Supervised
finetuning (SFT): 2B tokens of instruction data. This produced the Instruct models. They were trained on clusters of A100 and
H800 Nvidia GPUs, connected by
InfiniBand,
NVLink,
NVSwitch. The model code is under the source-available DeepSeek License. The architecture was essentially the same as the
Llama series. They used the
pre-norm decoder-only Transformer with
RMSNorm as the normalization,
SwiGLU in the feedforward layers,
rotary positional embedding (RoPE), and
grouped-query attention (GQA). Both had vocabulary size 102,400 (
byte-level BPE) and context length of 4096. They trained on 2 trillion tokens of English and Chinese text obtained by deduplicating the
Common Crawl.
Math DeepSeek-Math includes 3 models: Base, Instruct, and RL. Math was trained as follows: This reward model was then used to train Instruct using
Group Relative Policy Optimization (GRPO) on a dataset of 144K math questions "related to
GSM8K and MATH". The reward model was continuously updated during training to avoid reward hacking. This resulted in RL.
V2 3.1 and
Qwen 2.5 while matching
GPT-4o and
Claude 3.5 Sonnet.
R1 In January 2025, DeepSeek released the DeepSeek-R1 model under the
MIT License. DeepSeek-R1-Lite-Preview was trained for logical inference, mathematical reasoning, and real-time problem-solving. DeepSeek claimed that it exceeded performance of
OpenAI o1 on benchmarks such as
American Invitational Mathematics Examination (AIME) and MATH. However,
The Wall Street Journal reported that on 15 problems from the 2024 edition of AIME, the o1 model reached a solution faster. DeepSeek-R1 and DeepSeek-R1-Zero were initialized from DeepSeek-V3-Base and share its architecture. DeepSeek-R1-Distill models were instead initialized from other pretrained open-weight models, including
LLaMA and
Qwen, then fine-tuned on
synthetic data generated by R1. DeepSeek-R1-Zero was trained exclusively using GRPO RL without SFT. Unlike previous versions, it used no model-based reward. All reward functions were rule-based, "mainly" of two types (other types were not specified): accuracy rewards and format rewards. Accuracy reward was checking whether a boxed answer is correct (for math) or whether a code passes tests (for programming). Format reward was checking whether the model puts its thinking trace within a ... tag. However, on 28 May 2025, R1 was instead updated to version R1-0528. As of early July, R2 was not yet released, as Liang Wenfeng was not yet satisfied with its performance. Most Chinese cloud providers of R1 used
Nvidia H20. As of August, R2 was not yet released. Sources cite slow data labelling and chip problems. Specifically, DeepSeek was encouraged by authorities to adopt Huawei's Ascend chips for training, but it had stability issues, slower inter-chip connectivity and inferior software. Consequently, it has opted to use Nvidia chips for training and Huawei chips for inference. It is also reported that the
Cyberspace Administration of China requested several large corporations to stop buying Nvidia H20 and buy from domestic suppliers instead. With the release of R1 in January 2025, the DeepSeek team published a preprint on arXiv. ==Significance==