Quentin Anthony, Head of Model Training at Zyphra and an advisor at EleutherAI, recently joined Alessio Fanelli, Founder of Kernel Labs, on the Latent Space podcast to dissect the evolving landscape of AI hardware and developer productivity. The conversation offered a deep dive into Zyphra's bold move to AMD’s MI300X GPUs and Anthony’s unique perspective on leveraging AI tools for coding, providing sharp insights for founders, VCs, and AI professionals navigating the industry's complex technical frontiers.

Zyphra, a full-stack model company focused on building foundation models for edge deployment, has made a significant strategic pivot: migrating its entire training cluster to AMD. This decision stems from a conviction that the AMD ecosystem offers a "really compelling training cluster" that significantly reduces their bottom line. Anthony's journey with AMD began out of necessity during his PhD work on the Frontier supercomputer at Oak Ridge National Lab, an AMD MI250X-based system. This early exposure to AMD's hardware, even when its software stack lagged, provided invaluable experience in porting complex operations like Flash Attention.

The current generation MI300X GPUs, Anthony asserts, represent a pivotal shift. With 192GB of VRAM and superior memory bandwidth, these accelerators can outperform NVIDIA H100s on specific workloads. For tasks not bottlenecked by FP8 dense compute, such as certain Mixture of Experts (MoE) models, AMD's offerings provide a distinct advantage. This ample VRAM and bandwidth minimize the need for intricate parallelism strategies, streamlining development and boosting efficiency.

While AMD's hardware has undeniably caught up, the software ecosystem has been a historical hurdle. Anthony notes, however, that the software stack for the MI300X has also matured significantly. "Not a lot of people have sort of discovered that they've caught up on software and we're kind of capitalizing on that," he states, highlighting a strategic arbitrage opportunity. This parity in hardware and improving software support enables companies like Zyphra to extract substantial value, especially when coupled with a deep understanding of GPU architecture.

Anthony's approach to kernel development is rooted in a "bottom-up" philosophy. Rather than relying on high-level frameworks like Triton or Torch Compile, which he describes as being "beholden to the compiler," he often dives directly into ROCm or even GPU assembly. This granular control allows him to dictate precisely where tensors are materialized, optimizing for specific hardware properties. "The hardware of MI300X has these properties, and I want my algorithm to pull these properties out," he explains, underscoring the importance of tailoring code to exploit the underlying silicon.

This meticulous, hardware-aware approach is crucial for pushing performance boundaries, yet it reveals a current limitation in AI's role in low-level code generation. While AI coding tools excel at generating boilerplate code or assisting with high-level logic, their utility diminishes significantly when it comes to crafting highly optimized GPU kernels. Anthony observes that "if it's not dead basic, it's bad really fast" when models attempt to generate complex kernel code. The scarcity of public, high-quality GPU kernel datasets for training, coupled with the inherent difficulty in validating the correctness and performance of generated kernels, makes AI a less reliable partner for this specialized domain.

Reflecting on the controversial METR software engineering productivity study, which found developers felt faster but were actually slower with AI tools, Anthony shared his unique experience as one of the few who demonstrated a measurable speedup. His success, he believes, stems from a disciplined workflow that avoids the "slot machine effect" of endlessly prompting models. Instead, he prioritizes context management and prefers direct API access over integrated tools like Cursor, maintaining full control over the model's input and output. This allows him to strategically apply AI where it offers clear advantages, such as generating documentation or initial code skeletons, while reserving the critical, performance-sensitive kernel work for human expertise, often paired with an extended thinking model like GPT-5.

Looking ahead, Anthony sees a trend towards co-designing model architectures with specific hardware for inference efficiency. Zyphra, for instance, develops a spectrum of models—from compact 1.2B models for resource-constrained edge devices to larger 7B models for cloud deployment. This tiered approach allows smaller, on-device models to handle common queries, kicking back more complex tasks to larger, cloud-based models when necessary. This strategy optimizes for cost and latency, ensuring that computational resources are allocated efficiently across the diverse landscape of AI applications.

The AMD Uprising and the Art of AI Engineering

AI Daily Digest

The AMD Uprising and the Art of AI Engineering

Related Reading

AI Daily Digest