The recent release of OpenAI's GPT-OSS, its first open-weights model since GPT-2, has intensified the discourse around large language model development. In a recent Y Combinator video, Visiting Partner Ankit Gupta offered a sharp analysis, comparing GPT-OSS with established open-source leaders like Alibaba's Qwen3 and DeepSeek V3. His commentary delved into the distinct architectural philosophies, training methodologies, and post-training techniques employed by these powerhouses, revealing a fascinating landscape where varied approaches often converge on surprisingly similar performance metrics.
OpenAI’s GPT-OSS arrives as an autoregressive Mixture-of-Experts (MoE) transformer, available in 120 billion and 20 billion parameter versions. Crucially, "each token activates the top four experts, meaning only a portion of the total parameters are used at any given time," optimizing for efficient inference without sacrificing scale. It incorporates modern features such as Grouped Query Attention (GQA), SwiGLU activations, and Rotary Positional Embeddings (RoPE), alongside RMSNorm for stable training. A standout capability is its 131,072-token context window, achieved through YaRN scaling applied during pre-training, a method that essentially "births" the model with long-context awareness. The model was trained on a vast dataset focusing on STEM, coding, and general knowledge, with safety filters informed by GPT-4o. Released in a quantized format (MXFP4), GPT-OSS is designed for deployment on consumer-grade hardware, though the absence of an unquantized version limits exploration of its raw, unaligned potential.
Alibaba Cloud's Qwen3, launched to considerable anticipation, offers a diverse family of models, including both dense and MoE variants. These models range from a compact 0.6 billion parameters to a colossal 235 billion. Architecturally, Qwen3 dense models share similarities with GPT-OSS, leveraging GQA, SwiGLU, RoPE, and pre-normalization. A key innovation in Qwen3 is its QK-Norm, a dynamic rescaling of query and key vectors to maintain stable attention scales, replacing static QKV-bias. "The pre-training process for Qwen3 utilizes a large-scale dataset consisting of approximately 36 trillion tokens," twice that of its predecessor, Qwen2.5. This intensive, three-stage pre-training regimen culminates in a long-context stage where RoPE's base frequency is adjusted via the ABF technique and YaRN is applied at inference time to achieve a 128,000-token context without additional retraining. Post-training is a sophisticated four-step pipeline, including a "Thinking Mode Fusion" that integrates reasoning and non-reasoning capabilities into a single model, allowing users to toggle between modes.
DeepSeek V3, released earlier, presented a formidable challenge to the open-source landscape. A massive MoE model with 671 billion total parameters, of which 37 billion are active per token, DeepSeek V3 prioritizes both efficiency and capability. "At 671 billion parameters, it's a massive general-purpose base model, designed for efficiency as much as capability," Gupta highlighted. Its V3.1 update further refines this, introducing a hybrid thinking mode and smarter tool-calling. A significant architectural divergence is DeepSeek's Multi-head Latent Attention (MLA), which compresses keys and values into a smaller latent space before caching, then decompresses them during inference. This more complex approach yields superior memory savings and performance in long-context scenarios compared to GQA.
One of the most compelling insights from this comparison is the striking convergence of performance despite markedly different underlying technical strategies. "This is quite surprising. You'd expect that very different training methods would lead to very different results," Gupta observed. For instance, while all three models utilize YaRN for context extension, their implementation varies: GPT-OSS integrates it from pre-training, DeepSeek fine-tunes in stages, and Qwen applies it at inference. Similarly, the heavy reliance on reinforcement learning (RL) across all major models is notable, with some efforts requiring surprisingly small datasets, such as Qwen's use of roughly 4,000 query-verifier pairs for reasoning RL. This underscores the empirical nature of much current deep learning research, where successful techniques are often discovered through extensive experimentation rather than purely theoretical derivation.
Ultimately, while the open-source nature of these models allows for inspection of their weights and architectural details, a significant and often opaque "moat" remains in the proprietary data engineering. The colossal effort in curating, filtering, and augmenting trillions of tokens, as well as the intricate post-training pipelines, represents a formidable barrier to replication. "It's very difficult to replicate what they're releasing," Gupta concluded, emphasizing that the true competitive advantage often lies in these less visible, labor-intensive aspects of model development.
