The scaling wall facing large language models—specifically the prohibitive memory bandwidth requirements of Transformer architectures—is forcing a fundamental shift in neural network design. IBM Fellow Aaron Baughman recently detailed how State Space Models (SSMs) are not just an alternative, but a necessary evolution, promising a future of faster, more efficient, and more deployable generative AI. Baughman’s commentary serves as a critical technical briefing for founders and VCs seeking leverage in a market still constrained by high-cost, memory-intensive inference requirements.
Baughman spoke about the foundational mechanics of SSMs, positioning them as neural building blocks designed to handle sequential data—whether that data is text, speech, or time series—by efficiently managing memory over time. He explained that an SSM functions through a dual-component mathematical structure: the State Equation and the Observation Equation. The State Equation models how a hidden state evolves, essentially determining what the model remembers from the past sequence, while the Observation Equation maps that hidden state to an observable output, which, in the context of generative AI, is the next token in the sequence. This structure allows the model to continuously update its understanding of the world, or the context of the prompt, in real-time.
The core insight here is that SSMs circumvent the quadratic scaling problem inherent in the self-attention mechanism of Transformers. While a traditional Transformer must look back at and process every previous token in a sequence—leading to a computational cost that scales exponentially, or $O(N^2)$, with sequence length ($N$)—SSMs maintain a compact, implicit memory state that scales linearly, $O(N)$. This difference is not merely academic; it is the key to unlocking true scalability and efficiency. As Baughman put it, "AI Transformers, they remember everything. Whereas a State Space Model remembers only what really matters."
This efficiency directly addresses the most pressing hardware bottleneck in modern AI: memory bandwidth. Baughman highlighted the practical constraints of deploying massive Transformer models. For instance, an 8-billion parameter model barely fits on an H100 GPU with 80GB of memory, largely because the Key-Value (KV) cache required for attention mechanisms consumes vast amounts of memory, often taking up 90% of the capacity before the model can even perform its primary computations. When GPU utilization is low but memory bandwidth is maxed out, the system hits a data movement problem. SSMs mitigate this by using a compact memory structure, drastically reducing the demand on memory bandwidth and allowing for much faster inference with lower latency.
The evolution of SSMs has focused on optimizing this memory structure. Early breakthroughs like the Structured State Space Model (S4) taught AI how to remember efficiently. More recently, the Selective State Space Model (Mamba) introduced a crucial innovation: selectivity. Mamba dynamically updates its state based on the input, effectively allowing it to focus on relevant tokens and ignore unimportant ones—a form of attention without the quadratic cost. This capability is game-changing because it provides the flexible context handling previously dominated by Transformers, but with the speed and efficiency of SSMs. Baughman summarized the architectural leap: "S4 taught AI how to remember efficiently, while the Mamba family of models taught it how to remember intelligently."
The implications for the startup ecosystem and deployment strategy are profound. Models built on SSM architectures, particularly the Mamba family, are inherently suited for deployment on edge devices and consumer-grade hardware. Baughman noted that these models are often much smaller—ranging from 350 million to 1 billion parameters—yet achieve performance competitive with much larger Transformer models, even running efficiently on a CPU or a laptop. This shifts the economic calculus away from needing massive, expensive GPU clusters for inference and towards more decentralized, cost-effective deployment strategies.
Leading players are already integrating these insights. IBM’s own Granite V4 models, for example, are built on a hybrid architecture combining SSMs and Transformers. This hybrid approach leverages the best of both worlds: the robust performance and attention capabilities of Transformers for certain tasks, coupled with the speed and memory efficiency of SSMs. This convergence suggests that the future of AI model architecture will be defined not by a single paradigm, but by smart, hybrid designs optimized for specific constraints. The move toward SSMs is less about replacing Transformers entirely and more about overcoming their fundamental scaling limitations to create intelligent systems that are truly efficient and broadly deployable.



