DeepSeek

Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., doing business as DeepSeek, is a Chinese artificial intelligence (AI) company that develops large language models (LLMs). Based in Hangzhou, Zhejiang, DeepSeek is owned and funded by High-Flyer, a Chinese hedge fund. DeepSeek was founded in July 2023 by Liang Wenfeng, the co-founder of High-Flyer, who also serves as the CEO for both of the companies. The company launched an eponymous chatbot alongside its DeepSeek-R1 model in January 2025.

History

Founding and early years (2016–2023) In February 2016, High-Flyer was co-founded by AI enthusiast Liang Wenfeng, who had been trading since the 2008 financial crisis while attending Zhejiang University. The company began stock trading using a GPU-dependent deep learning model on 21 October 2016; before then, it had used CPU-based linear models. By the end of 2017, most of its trading was driven by AI. Liang established High-Flyer as a hedge fund focused on developing and using AI trading algorithms, and by 2021 the firm was using AI exclusively, often using Nvidia chips. In 2019, the company began constructing its first computing cluster, Fire-Flyer, at a cost of 200 million yuan; it contained 1,100 GPUs interconnected at 200 Gbit/s and was retired after 1.5 years in operation. before the United States restricted chip sales to China. High-Flyer announced the launch of an artificial general intelligence (AGI) research lab, stating that the new lab would focus on developing AI tools unrelated to the firm's financial business. Two months later, on 17 July 2023, DeepSeek-V2 was released in May 2024, followed a month later by the DeepSeek-Coder V2 series. On 20 November 2024, the preview of DeepSeek-R1-Lite became available via chat. On 24 March 2025, DeepSeek released DeepSeek-V3-0324 under the MIT License. On 28 May 2025, DeepSeek released DeepSeek-R1-0528 under the MIT License. The model has been noted for more tightly following official Chinese Communist Party ideology and censorship in its answers to questions than prior models. On 21 August 2025, DeepSeek released DeepSeek V3.1 under the MIT License. This model features a hybrid architecture with thinking and non-thinking modes. It also surpasses prior models like V3 and R1, by over 40% on certain benchmarks like SWE-bench and Terminal-bench. It was updated to V3.1-Terminus on 22 September 2025. V3.2-Exp was released on 29 September 2025. It uses DeepSeek Sparse Attention, a more efficient attention mechanism based on previous research published in February. DeepSeek-V3.2 was released on 1 December 2025, alongside a DeepSeek-V3.2-Speciale variant that focused on reasoning. In February 2026, Anthropic accused DeepSeek of using thousands of fraudulent accounts to generate millions of conversations with Claude to train its own large language models. In April 2026, investors began speaking with DeepSeek for a $300 million funding round, which would bring DeepSeek to a total valuation of $10 billion. On April 24, 2026, DeepSeek released a preview of its V4 series, including the 1.6-trillion parameter DeepSeek-V4-Pro and the 284-billion parameter DeepSeek-V4-Flash, both featuring a 1-million token context window, under the MIT License. ==Company operation==

Company operation

DeepSeek is headquartered in Hangzhou, Zhejiang, and is owned and funded by High-Flyer. Its co-founder, Liang Wenfeng, serves as CEO. As of May 2024, Liang personally held an 84% stake in DeepSeek through two shell corporations. Strategy DeepSeek has stated that it focuses on research and does not have immediate plans for commercialization. This posture also means it can skirt certain provisions of China's AI regulations aimed at consumer-facing technologies. Due to the impact of United States restrictions on chips, DeepSeek refined its algorithms to maximise computational efficiency and thereby leveraged older hardware and reduced energy consumption. DeepSeek also expanded on the African continent as it offers more affordable and less power-hungry AI solutions. The company has bolstered African language models and generated a number of startups, for example in Nairobi. Along with Huawei's storage and cloud computing services, the impact on the tech scene in sub-saharan Africa is considerable. DeepSeek offers local data sovereignty and more flexibility compared to Western AI platforms. ==Training framework==

Training framework

High-Flyer/DeepSeek had operated at least two primary computing clusters: Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer 1 was constructed in 2019 and was retired after 1.5 years of operation. Fire-Flyer 2 is still in operation as of 2025. Fire-Flyer 2 consists of co-designed software and hardware architecture. On the hardware side, Nvidia GPUs use 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for high bisection bandwidth. On the software side are: • hfreduce: Library for asynchronous communication, originally designed to replace Nvidia Collective Communication Library (NCCL). It is mainly used for allreduce, especially of gradients during backpropagation. It is asynchronously run on the CPU to avoid blocking kernels on the GPU. As of 2022, Fire-Flyer 2 had 5,000 PCIe A100 GPUs in 625 nodes, each containing 8 GPUs. It later incorporated NVLinks and NCCL to train larger models that required model parallelism. ==Development and release history==

Development and release history

The first DeepSeek models were essentially the same as Llama, which were dense decoder-only transformers. Later models incorporated the multi-head latent attention (MLA), Mixture of Experts (MoE), and KV caching. A decoder-only transformer consists of multiple identical decoder layers. Each of these layers features two main components: an attention layer and a feedforward network (FFN) layer. V2 replaced the standard multi-head attention mechanism (MHA) with multi-head latent attention (MLA). This introduces compressed latent vectors to reduce KV (key–value) cache size, and thus memory usage. A standard MoE Transformer generally use the sparsely-gated MoE layers in the FFN layers. In such an MoE layer, there are several FFN modules in parallel ("routed experts") and a small classifier ("gate") to compute a score for all these modules upon each token. Only the highest-scoring modules are activated. Starting with DeepSeekMoE, DeepSeek adopted a variant that adds "shared experts", which are always activated. ==Overview of models and technical specifications==

Overview of models and technical specifications

DeepSeek's models are "open weight", which provides less freedom for modification than true open source software. The training program was: • Pretraining: 1.8T tokens (87% source code, 10% code-related English (GitHub markdown and Stack Exchange), and 3% code-unrelated Chinese). • Long-context pretraining: 200B tokens. This extends the context length from 4K to 16K. This produced the Base models. • Supervised finetuning (SFT): 2B tokens of instruction data. This produced the Instruct models. They were trained on clusters of A100 and H800 Nvidia GPUs, connected by InfiniBand, NVLink, NVSwitch. The model code is under the source-available DeepSeek License. The architecture was essentially the same as the Llama series. They used the pre-norm decoder-only Transformer with RMSNorm as the normalization, SwiGLU in the feedforward layers, rotary positional embedding (RoPE), and grouped-query attention (GQA). Both had vocabulary size 102,400 (byte-level BPE) and context length of 4096. They trained on 2 trillion tokens of English and Chinese text obtained by deduplicating the Common Crawl. Math DeepSeek-Math includes 3 models: Base, Instruct, and RL. Math was trained as follows: This reward model was then used to train Instruct using Group Relative Policy Optimization (GRPO) on a dataset of 144K math questions "related to GSM8K and MATH". The reward model was continuously updated during training to avoid reward hacking. This resulted in RL. V2 3.1 and Qwen 2.5 while matching GPT-4o and Claude 3.5 Sonnet. R1 In January 2025, DeepSeek released the DeepSeek-R1 model under the MIT License. DeepSeek-R1-Lite-Preview was trained for logical inference, mathematical reasoning, and real-time problem-solving. DeepSeek claimed that it exceeded performance of OpenAI o1 on benchmarks such as American Invitational Mathematics Examination (AIME) and MATH. However, The Wall Street Journal reported that on 15 problems from the 2024 edition of AIME, the o1 model reached a solution faster. DeepSeek-R1 and DeepSeek-R1-Zero were initialized from DeepSeek-V3-Base and share its architecture. DeepSeek-R1-Distill models were instead initialized from other pretrained open-weight models, including LLaMA and Qwen, then fine-tuned on synthetic data generated by R1. DeepSeek-R1-Zero was trained exclusively using GRPO RL without SFT. Unlike previous versions, it used no model-based reward. All reward functions were rule-based, "mainly" of two types (other types were not specified): accuracy rewards and format rewards. Accuracy reward was checking whether a boxed answer is correct (for math) or whether a code passes tests (for programming). Format reward was checking whether the model puts its thinking trace within a ... tag. However, on 28 May 2025, R1 was instead updated to version R1-0528. As of early July, R2 was not yet released, as Liang Wenfeng was not yet satisfied with its performance. Most Chinese cloud providers of R1 used Nvidia H20. As of August, R2 was not yet released. Sources cite slow data labelling and chip problems. Specifically, DeepSeek was encouraged by authorities to adopt Huawei's Ascend chips for training, but it had stability issues, slower inter-chip connectivity and inferior software. Consequently, it has opted to use Nvidia chips for training and Huawei chips for inference. It is also reported that the Cyberspace Administration of China requested several large corporations to stop buying Nvidia H20 and buy from domestic suppliers instead. With the release of R1 in January 2025, the DeepSeek team published a preprint on arXiv. ==Significance==

Significance

DeepSeek's success against larger and more established rivals was a surprise to both the industry and to markets, and has been compared by investors and pundits to the "Sputnik moment". The company claims that it trained V3, a predecessor of R1, for US$6 million compared to US$100 million for OpenAI's GPT-4 in 2023, It was dubbed the "Pinduoduo of AI", and other Chinese tech giants such as ByteDance, Tencent, Baidu, and Alibaba cut the price of their AI models. Despite its low price, it was profitable compared to its money-losing rivals. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com