Liang Wenfeng's $294K DeepSeek-R1 RL Breakthrough Reached Nature

How Liang Wenfeng's DeepSeek-R1 used Group Relative Policy Optimization and pure reinforcement learning to produce emergent reasoning capabilities for $294,000 in training compute, and why the paper reached the cover of Nature.

Jun 30 at 9:00 AM5 min read

Liang Wenfeng, DeepSeek-R1 training pipeline technical contribution, 2025 — The multistage training pipeline of DeepSeek-R1, as published in the arXiv paper (January 2025).· Figure from DeepSeek-R1 paper (Daya Guo et al.), via Wikimedia Commons (CC BY 4.0)

A $294,000 training run with no human-labelled reasoning data produced DeepSeek-R1, the paper that subsequently reached the cover of Nature. Liang Wenfeng, co-founder and CEO of DeepSeek, is listed as the corresponding author. The paper, published on arXiv in January 2025 under the title DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, demonstrated that a language model trained on pure reinforcement learning signals, without any supervised fine-tuning on human-curated reasoning chains, could produce world-class mathematical and coding reasoning.

Related startups

Liang Wenfeng co-founded DeepSeek as the AI research arm of High-Flyer Capital, a Chinese quantitative hedge fund. DeepSeek's earlier model, DeepSeek-V3, was reported to have a $5.6M training budget and served as the base on which the R1 work was built. This piece focuses on the specific technical architecture of R1 and what made it significant enough for Nature's editors.

Training without a teacher: how GRPO replaced supervised fine-tuning

The standard large language model training recipe in 2024 followed three stages: pre-train a base model on large text corpora; fine-tune it on human-curated examples with step-by-step reasoning (supervised fine-tuning, or SFT); then apply reinforcement learning, typically Proximal Policy Optimization (PPO), to align outputs further. The SFT stage for reasoning required expensive chains of thought labelled by humans or distilled from a more capable teacher model.

DeepSeek-R1-Zero, the experimental variant described in the paper, removed the SFT stage entirely. Starting from DeepSeek-V3-Base, the team applied reinforcement learning directly using a custom algorithm, Group Relative Policy Optimization (GRPO). GRPO's structural advantage over PPO is that it does not require a separate evaluator model of the same scale as the policy model. Instead, it samples a group of responses and uses relative performance within that group as the reward signal. As the paper states, GRPO "directly estimates the baseline from the group scores," cutting the compute footprint of the RL stage substantially. The reward signal itself was minimal: positive feedback for a correct final answer on math or coding problems, and for adherence to the required output format (thinking wrapped in tags, final answer in a separate tag). No partial credit, no reasoning-quality scoring, no human feedback at any stage.

The cost implication is direct. DeepSeek-V3's training was reported by 36kr to have run to approximately $5.6 million; the R1 training run cost $294,000, per the same reporting citing the DeepSeek team. The savings came from removing the SFT stage and eliminating the need for a matching-scale evaluator in the RL phase.

Bar chart comparing DeepSeek-V3 training cost of $5.6M versus DeepSeek-R1 training cost of $294K — DeepSeek training compute costs: V3 ($5.6M) versus R1 ($294K). Source: 36kr; DeepSeek-R1 paper.

The 'aha moment' and what Nature's editors recognised

The most notable result in the paper was not the cost reduction. It was the behaviour the model developed without being instructed to. DeepSeek-R1-Zero, during RL training, began allocating more reasoning tokens to harder problems and started explicitly noting within its chain of thought when an earlier step was incorrect and correcting it mid-chain. The paper documents the training checkpoint where this first occurs as an "aha moment." The four emergent behaviours, none of them specified in the reward function, were: self-reflection, self-verification, dynamic strategy adaptation, and active exploration of alternative solution approaches.

The significance of this result for Nature's editors was its bearing on a theoretical question that had been debated since 2023: whether reinforcement learning alone, applied to a base language model with no reasoning scaffolding, could induce the kind of structured self-correction that had previously required explicit chain-of-thought supervision. 36kr reported that the DeepSeek-R1 paper reached the cover of Nature, with Liang Wenfeng as the corresponding author; the team subsequently addressed public questions about reproducibility, per the same reporting.

Doughnut chart showing the four emergent RL behaviors in DeepSeek-R1-Zero: self-reflection, self-verification, dynamic strategy, and exploration, shown in equal segments — Four emergent behaviors identified in DeepSeek-R1-Zero during RL training; equal segments reflect the paper's qualitative treatment, not a severity ranking. Source: DeepSeek-R1 arXiv paper.

Distilled models and the open-weight downstream effect

Alongside the R1 model, the January 2025 release included a suite of smaller open-weight models built via knowledge distillation: R1-Distill-Qwen-7B, R1-Distill-Qwen-14B, R1-Distill-Qwen-32B, and R1-Distill-Llama-70B. These were not trained with RL from scratch; instead, R1's reasoning traces were used as training data for smaller base models, enabling them to carry the chain-of-thought format into more deployable sizes.

The open-weight release continued DeepSeek's practice with its earlier models. Within weeks of publication, GRPO was incorporated into open-source training frameworks and researchers at other labs began publishing variants of the RL-only training approach. The contrast with the closed-lab model championed by labs like Safe Superintelligence drew attention in the research community, where the rapid replication of DeepSeek-R1's approach served as a practical test of the paper's claims.

Horizontal bar chart showing parameter counts for four DeepSeek-R1 distilled models: 7B, 14B, 32B, and 70B — DeepSeek-R1 distilled open-weight models by parameter count, released January 2025. Source: DeepSeek-R1 arXiv paper.

What it means

DeepSeek-R1 is notable for two results that are easy to conflate. The first is cost: a $294,000 training run producing reasoning capabilities that matched models costing far more. The second is the scientific result: demonstrating that pure RL, without curated reasoning data, produces self-correcting behaviour in language models, and that GRPO offers a computationally lighter path than PPO for that RL stage. The first fact made headlines in early 2025; the second is what Nature published. Both represent Liang Wenfeng's specific contribution as the corresponding author and technical leader of the project. The subsequent replication by other labs, and the incorporation of GRPO into open-source training pipelines, is the standard measure of a research result that holds.

Sources

Editorial standards: every claim is sourced. Tips: editor@startuphub.ai

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Liang Wenfeng #DeepSeek #DeepSeek-R1 #reinforcement learning #GRPO #AI reasoning #open weight AI #China AI