Artificial Intelligence

The 2 Gigawatt AI Inference Problem

AI inference now draws 2 gigawatts of power at scale. From agentic workflows to multimodal AI, serving costs compound—but new platforms trim watts and dollars per token.

It’s come time to read the meter.

Every answer from an AI system draws power, time, and money. When a feature goes viral, those draws add up like clock ticks on a utility meter. Inference—the act of running a trained model to produce tokens, images, audio, or video—is no longer a rounding error. It is the day‑to‑day business of AI: what users feel as speed, what operators experience as throughput and tail latency, and what finance sees as a recurring bill.

OpenAI offers a clear sense of scale. ChatGPT usage sits in the hundreds of millions of users per month, with figures of 700–800 million cited publicly. Its APIs have been described as processing about 8 billion tokens per minute across endpoints. Each token is a small unit of work: memory moves, matrix multiplies, cache reads, and network hops. At this altitude, shaving 20–50 milliseconds from the decode loop or reducing compute per token by 10–20 percent can be the difference between a product that feels instant and one that makes people wait.

What inference is, and why it sped up

First, a quick grounding: Training is about learning; inference is about doing.

Inference happens right now, for this user, for this request. Three constraints govern the experience and the economics: latency (how long someone waits), throughput and its tail behavior (how many people you can serve at once and how predictable the slowest turns are), and unit cost (dollars per request, often summarized as dollars per million tokens). The serving playbook pushes on those constraints in four practical ways: reuse work already done, avoid work that does not need to be done, place the work closer to the user, and keep accelerators busy without letting the slowest requests stretch the tail.

In plain terms, that playbook starts with model compression: use quantization (INT8/INT4, FP8/NF4) so weights and activations take fewer bits, fit in memory, and run faster. Second, adaptive compute: route easy questions to small models and escalate only when needed; inside a large model, use mixture‑of‑experts so only a subset of “experts” activates per token rather than the entire network. Third, decoding and attention efficiency: use speculative decoding so a small “drafter” proposes a short run of tokens and the target model verifies them in one pass; maintain a KV cache so the model does not recompute the entire history at every step; adopt attention kernels that move fewer bytes. Fourth, system and hardware optimization: employ iteration or continuous batching so new requests can join in‑flight batches between decode steps; use better kernels to reduce memory traffic; and place latency‑sensitive models near users.

Glossary of Terms

What is KV Caching

Key-Value caching saves the intermediate attention states from previous tokens so the model doesn’t need to recompute them each time it generates a new word. By reusing these stored values, responses flow faster and cost less energy, especially for long prompts or multi-turn conversations.

What is quantization?

Quantization reduces the numerical precision of a model’s weights and activations—often from 16-bit to 8-bit or lower—so calculations run faster and memory usage drops sharply. Done carefully, it keeps accuracy nearly identical while improving throughput and cutting power draw.

What is mixture-of-experts (MoE)?

Mixture-of-experts models divide a large network into many smaller “experts.” For any given token, only a few experts activate, saving compute while preserving quality. It’s a way to make giant models behave efficiently at inference time.

What is speculative decoding?

Speculative decoding lets a small “draft” model guess several upcoming tokens that a larger model then verifies in parallel. If most guesses are right, the system leaps ahead—reducing the waiting time between a user’s input and the model’s full answer.

What is adaptive compute?

Adaptive compute means the system adjusts how much processing power each query receives. Simple prompts take a light path through smaller or shallower networks; complex ones trigger heavier routes. It keeps latency low and budgets predictable.

What is model compression?

Model compression covers pruning, distillation, and quantization techniques that shrink model size and speed up inference. The idea is to keep almost the same intelligence in fewer parameters so deployment is cheaper and fits on smaller hardware.

What is batching and continuous batching?

Batching groups multiple user requests together so a GPU can process them in one pass. Continuous batching takes it further, inserting new requests into ongoing computation streams. Both maximize GPU utilization and reduce idle cycles, translating into lower cost per token.

What is FlashAttention or attention optimization?

Attention optimization rewrites the GPU math so data moves less between memory and cores. FlashAttention is one such method: it fuses operations into a single kernel, slashing overhead and speeding long-context processing dramatically.

What is low-precision arithmetic (FP8/BF16)?

Modern GPUs and TPUs can run matrix math in smaller number formats such as FP8 or BF16. This cuts the time and energy needed for each operation while maintaining model fidelity through calibration—boosting tokens per watt.

What is routing or cascaded serving?

Routing systems decide which model handles a request: small models for everyday tasks, large ones for complex reasoning. It’s like triage for compute—minimizing cost while maintaining reliability and user experience.

What is caching for prompts or prefixes?

Prompt caching stores common or repeated prompts—like a chatbot’s system message or instructions—so they don’t need reprocessing. The effect is faster first-token times and smoother repeated interactions across sessions.

What is adaptive scheduling and placement?

Schedulers now place workloads on GPUs and regions based on latency, carbon intensity, or local power price. Putting small models near users and large ones in central clusters shortens response times and balances the grid load.

What is a GPU or TPU?

GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) are the specialized chips that run AI models. GPUs, originally built for rendering graphics, excel at parallel math across thousands of cores. TPUs, custom-built by Google, are optimized for tensor operations common in neural networks. Both process massive amounts of data simultaneously, but TPUs focus on efficiency and throughput in data centers, while GPUs offer flexibility for developers and startups. The faster and denser these chips get, the more tokens per second they can deliver—and the lower the cost of every AI response.

For years, most concern focused on training. That concern was justified; the numbers are concrete and visible. Epoch AI has tallied training compute growth at roughly 4× per year on average since 2010. By mid‑2025, more than 30 frontier models had passed the 1e25 FLOP threshold. Run durations have stretched from months toward a year at the very frontier. These are counts, not impressions, and they matter for serving. Longer context windows, richer modalities, and higher ceilings on reasoning make products more capable, but they also raise the steady‑state work a service must perform. Training happens in bursts; inference runs all day. In other words, once a model ships, the recurring bill starts rather than stops.

How people use these systems is shifting as well.

Agentic AI turns a single request into many small steps. Ask for a working prototype, a code refactor, a research brief, or a product plan, and the system does not make one pass. It plans sub‑tasks, retrieves information, calls tools, writes and tests code, packages artifacts, performs checks, deploys, and revises. Each of those steps can involve multiple model calls. A single prompt can quietly trigger thousands of inferences, some in sequence and some in parallel. And this recursiveness is not limited to coding. Everyday work—summarizing meetings, preparing proposals, coordinating schedules, producing first drafts—naturally expands into chains of retrieval, drafting, review, and revision. That compounds the depth and activity we should expect as agentic coworkers become normal in knowledge work. For readers who want a snapshot of what people actually use, the a16z lists of fast‑rising AI apps and their analysis on agentic coworkers show how quickly these patterns are moving from demos into daily usage.

To be fair, per-call inference costs are plummeting—dropping up to 1,000x since 2022 thanks to efficient models and new GPUS, like Blackwell, making advanced AI more accessible. Yet, this is offset by surging call intensity in Agentic AI systems, spawning recursive chains of thousands of back-to-back inferences.

Now the curve steepens further with multimodal.

What began with text now routinely includes images and audio—and increasingly, full video—which turns a steady stream of tokens into something closer to highway traffic. An image is a grid of pixels; audio is a continuous waveform; video multiplies both frame by frame, second by second. Moving from text to video is a step change, not a gentle incline. Serving video well means tiled or paged attention, chunked diffusion, frame‑level caching, and distributed schedulers that keep accelerators saturated while holding tails in check. The same serving principles still apply—reuse, avoid, place, batch—but the volume of computation per second is simply larger.

A practical response at the application layer is to arrange chains so the system avoids paying “big” at every link. Emergent illustrates this approach. Emergent is an Agentic coding platform, allowing consumers to build production grade apps via natural-language and backend AI agents.

The product turns one prompt into an orchestrated line of agents that plan, code, test, and deploy together, with orchestration designed to keep builders in flow. To capitalize on LLM strengths, Emergent parallelizes tasks across a multi-agent system. A central orchestrator routes work to domain-expert sub-agents (testing, vision, integration, triage, and deployment), allowing each to solve its slice independently. This lowers the main agent’s context/token burden and keeps it focused on the core reasoning path—yielding faster responses and lower cost.

“We indexed on quality first," admitted Emergent CEO Mukund Jha. "Now we index on latency and orchestration, because every second saved compounds across hundreds of agent steps.” This approach enabled Emergent take the industry by storm, with over one million users authoring about 1.5 million apps on the platform. They boast $20 million ARR in four months of operations.

This focus on low-latency orchestration, seen with Emergent, handles the interactive side of agentic work. But a different, even more inference-heavy problem emerges with deep agentic research.

A single complex question—like "conduct a full market analysis" or "find all supporting and refuting evidence for a scientific claim"—can't be answered in seconds. It represents a chain of retrieval, synthesis, and validation that could consume tens of thousands of inferences.

This is the challenge platforms like Parallel.ai, Perplexity, or Caesar Data are built to absorb. Caesar is an asynchronous, API-first research engine. Instead of a real-time chat, a developer submits a query and receives a job ID. In the background, Caesar's agents perform the heavy lifting: scraping, retrieving, reasoning, and synthesizing a fully-cited answer.

Caesar’s API turns complex queries into expert-level, cited research. It handles the end-to-end orchestration, inference cost, and synthesis of agentic workflows. This veers into the aspect of certainty. It allows developers to embed an expert-level research engine without having to build or pay for the massive, recurring inference orchestration themselves.

The small‑model turn: open weights and cascades

Alongside advances at the frontier, a counter‑current has gathered pace: smaller language models and open‑weight releases that can be quantized and run close to where work happens. GPT‑OSS in 120B and 20B configurations is a good marker of this shift. Because the weights can be compressed (for example, to FP8/INT8/INT4) and served on modest fleets or even private clusters, inference no longer belongs only to hyperscalers. Capability moves into robots, vehicles, factory cells, and on‑prem deployments where latency, privacy, or bandwidth make locality the better choice.

Claude Haiku 4.5 sits in the same lane: a compact model tuned for fast, low‑cost inference that handles most turns without escalating to an ultra‑scale model. That behavior keeps p95 latency and cost per token in check while preserving a path to escalate when a task truly needs it.

The design pattern that emerges is layered. Small, task‑tuned models answer the bulk of requests, often paired with retrieval so they can pull fresh context on demand. Mid‑tier specialists handle domain logic or more nuanced reasoning. Ultra models intervene only when necessary. The largest models become training fountainheads and distillation sources; day‑to‑day intelligence lives nearer to the edge. In practice, this means better tokens‑per‑watt, lower round‑trip time, and clearer unit economics—especially when routing policies steer easy cases to small models and reserve scarce capacity for the hard tail.

This shift also improves the user experience. People equate quickness with competence; a fast first token and steady follow‑through build trust. Small models placed near users help achieve both, and because they are open‑weight, teams can audit, adapt, and govern them in ways that fit local requirements.

As the load rises, a cohort of inference specialists focuses on efficiency—Together, FriendliAI, Groq, and Cerebras. These outfits optimize throughput and tail latency. They trade splashy model headlines for the unglamorous work of moving tokens faster, more steadily, and at lower cost. This cohort includes pure-play Inference-as-a-Service platforms like Together, FriendliAI, and Baseten, alongside firms offering alternative silicon for optimization, such as Groq and Cerebras.

FriendliAI, in inference terms, plays both tracks: decoding/attention efficiency (KV and prompt caching, speculative decoding) and deep system/hardware tuning (kernel work, iteration/continuous batching, quantization, tight GPU scheduling). The aim is simple: fewer bytes moved, more useful math per second, flatter tails.

“If your model is complex and you have traffic, GPUs multiply, costs explode,” says founder Byung‑Gon Chun. The response is discipline: trim the decode loop, reuse state, cache aggressively, and batch wherever possible so the GPUs stay hot. In internal and public comparisons, their serving stack has been reported running roughly 2–2.5× faster than standard deployments, with more predictable p95.

The results are visible in applied settings. LG’s EXAONE—trained on proprietary industrial data—routes through FriendliAI for workloads like battery design and vehicle systems. In robotics, fifty milliseconds versus five hundred is not just UX; it can be safety. That is where decoding/attention shortcuts and system‑level batching matter: trim per‑token overhead, keep accelerators saturated, and hold the tail in check.

Under the hood, FriendliAI is an LLM serving engine built for low latency and high throughput. The innovations stem from kernel‑level serving optimizations, correctness guarantees, and tight GPU scheduling—designed for the factory floor and the lab, where a missed token can be a missed tolerance. See their vLLM‑alternative stack, TensorRT‑LLM comparisons, and work on iteration batching that underpins both throughput and tail latency control.

Impala points in the same direction, but with an explicit enterprise stance: treat inference as a hyperscaled utility behind a single, serverless-like surface so teams don’t manage capacity, quotas, or burst math - they just ship. The claim is straightforward: keep p95 low while letting demand spike; drive dollars-per-million-tokens down by pushing utilization up and idle time down. In other words, make inference “invisible” so builders focus on products rather than schedulers.

"Our north star is simple: if intelligence scales, capacity shouldn’t be the failure mode. Inference should feel invisible - a commodity in the best sense: reliable, economical, and everywhere, so the world can think without friction."

It is worth noting the broader cohort’s shape. Groq—among the most heavily funded—pursues a hardware‑plus‑software path, building custom silicon and a compiler/runtime stack to push tokens per second and tokens per watt. Decart leans into ultra-low-latency deployment for applications where every millisecond matters; Together focuses on flexible, multi‑model serving and routing. Different plays, same pressure: higher tokens/sec, lower cost per million tokens, and tighter p95s at production load.

See their vLLM alternativeTensorRT-LLM comparison, and work on iteration batching that underpins throughput and tail latency control.

A natural question follows: where does the next unit of efficiency come from—more disciplined serving platforms, or deeper vertical integration at the hardware frontier?

On the hardware edge, InferenceMAX‑style estimates put NVIDIA Blackwell (B200/GB200) at the front of tokens‑per‑second and tokens‑per‑watt. With aggressive software—quantization, speculative decoding, paged/flash attention, and iteration/continuous batching—token revenue can outpace hardware cost by an order of magnitude.

Seen through that lens, the market is separating into two lanes—a barbell.

Hyperscale: vertically integrated stacks with tight silicon‑to‑software control and multi‑gigawatt sites. These providers push new accelerators into production early, squeeze efficiency from compilers and kernels, and aggregate massive workloads to keep utilization high.

Hyper‑efficiency: lean operators and platforms that squeeze milliseconds and dollars from commodity fleets or specialized chips. They compete on serving craft—routing, caching, batching, scheduling—and on proximity to end users. The middle tier—teams that simply rent capacity and hope scale solves the rest—is getting squeezed; without proximity or vertical control, costs stay high and tail latency stays stubborn.

Independent neoclouds and alternative hyperscalers, such as CoreWeave and Vultr, give AI‑native teams on‑demand accelerators without full hyperscaler lock‑in. Serving platforms like FriendliAI and Together help those teams extract more work per watt from the hardware they rent. Milliseconds become margin; margin decides who ships real‑time features versus slideware.

Decart is an example of the kind of user expectation this new architecture enables, trading sub-millisecond inference with agents that are exchanging a real-time view of the world to change user expectations.

This is also where the layered routing pattern pays off: small and mid models can live near users for fast first tokens while ultra models remain centralized, with policies that escalate only when a task truly needs it.

In practice, the advantage shows up in placement rather than slogans. Vultr’s globally distributed GPU regions and edge proximities shorten distance to users; p95 falls, tail latency becomes manageable, and data‑residency options widen. Small and mid models can live near the edge; ultra models can stay centralized. Simple placement policies balance cost per token with responsiveness. Capacity grows horizontally across regions; predictable pricing replaces long procurement cycles—a better fit for agent swarms and bursty IoT than queueing for a single mega‑region. With B200 and B300 available on demand (and GB300 systems taking orders), topology follows load, not paperwork. Think of it as inference‑aware topology: batching and precision choices travel with the workload so cost per call and tail latency are tuned region by region.

“We compete on performance-per-dollar and latency," says Vultr’s Kevin Cochrane. "More tokens delivered per dollar on infrastructure consumed matter. And faster inference matters to deliver a superior customer experience.”

The placement story points at a larger reality: capacity is now planned and financed like a utility. In the same spirit as Sequoia Capital’s 'AI's $600B question', a back‑of‑the‑envelope on the revenue base needed to service GPU spend, the direction is what matters; applied to power, it’s now becoming clear that a two‑gigawatt baseline is a sensible planning pulse for the largest serving surfaces. To see why, look at the scale of recent AI infrastructure deals—and then sanity‑check that against rising token demand.

Economics, deals, and interdependence

The scale of today’s AI infrastructure resembles nation building. Multi‑billion‑dollar contracts, multi‑gigawatt campuses, and long supply agreements now define the landscape. A useful rule of thumb is that the ecosystem must generate enough consumer and enterprise revenue to service a recurring inference bill, not just one‑off training capex. What looks like a market also behaves like a metabolism: electrons and dollars circulate within a handful of firms as subsidy, revenue, and investment converge. Concentration helps explain why a small cluster captures a large share of profit growth and capex.

Deals in numbers (selected; value and GW are targets/estimates; peaks and averages vary):

AnnouncedCompaniesHardwareDetailsValue / Power (GW)
Mar 18, 2024Oracle Cloud
NVIDIA
DGX Cloud, Grace BlackwellSovereign AI and OCI Supercluster0.5-1 GW*
Apr 9, 2024Google CloudAxion (Custom Arm CPU)Initial announcement of custom Arm-based CPU
Jul 9, 2024AWSGraviton4 (Custom Arm CPU)General availability of Graviton4-powered instances
Oct 16, 2024Microsoft AzureCobalt 100 (Custom Arm CPU)General availability of Cobalt 100-based VMs
Oct 30, 2024Google CloudAxion (Custom Arm CPU)General availability of C4A, first Axion-based VMs
Nov 22, 2024Anthropic
AWS
Trainium/InferentiaAWS named primary cloud and training partner; Amazon total investment reaches $8B$8B / >0.5 GW*
Jan 21, 2025OpenAI / Stargate ProjectInitial announcement of $500B "Stargate" AI infrastructure project$500B (Planned)
Jul 9, 2025OpenAI
Oracle Cloud
$30B/year deal for 4.5 GW of data center capacity under "Stargate" initiative$30B yearly / 4.5 GW
Jul 22, 2025OpenAI
Oracle Cloud
Add 4.5 GW of new data center capacity to Stargate project4.5 GW
Oct 6, 2025OpenAI
AMD
MI450Up to 6 GW of compute, includes warrant for up to 10% of AMD6 GW
Oct 14, 2025Oracle Cloud
AMD
MI450Partnership to launch 50,000-GPU supercluster starting in Q3 20260.1-0.3 GW*
Oct 14, 2025Oracle Cloud
NVIDIA
NVIDIA AI PlatformsSovereign AI initiatives, starting with Abu DhabiNot specified
Oct 15, 2025Microsoft/ NVIDIA
Aligned Data Centers
Data Center CapacityAcquisition of Aligned Data Centers~$40B / >5 GW
Oct 23, 2025OpenAI/Oracle
Vantage Data Centers
Data Center Campus"Stargate" data center site in Wisconsin~1 GW
Oct 23, 2025Anthropic
Google Cloud
Google TPUsMulti-billion dollar deal for access to up to one million TPUs>1 GW
Estimates reflect publicly discussed buildout capacity, not necessarily full and immediate consumption. GW values represent project targets or industry projections; actual operational peaks and averages vary. *Estimated Gigawatts: This is an estimate based on the deal's financial scale, hardware involved, and comparison to projects with stated power targets.

Bounding the electricity

Precise fleet counts are scarce, but the direction is clear. As mainstream usage compounds—more concurrent sessions, longer interactions, tighter round‑trip targets—the steady draw for the largest serving surfaces settles in the low gigawatts. The point is not precision; it’s recognizing that inference now behaves like a utility load. Planning in megawatts as well as GPUs is becoming standard practice.

The new Frontier Data Centers hub from Epoch AI, which uses satellite and permit data to track construction, visually confirms that multiple major AI clusters are planned to operate at or above 1 GW within the next year, validating this baseline figure.

Projections underscore the shift. In fact, US data centers consumed roughly 183 TWh in 2024—about 4% of national electricity—and widely cited outlooks place 2030 around 300–400 TWh, equivalent to a continuous 35–50 GW. Taking a broader view, industry scenarios push AI‑related demand into the low‑hundreds of gigawatts by the mid‑2030s, while global adds of roughly 69–141 GW are forecast across 2025–2030. In that landscape, a 2‑gigawatt serving surface is a single‑campus pulse, not a national number.

AI Data Center Power Demand — Key Projections

  • US data centers consumed ~183 TWh in 2024 (~4% of national electricity)
  • 2030 outlook: 300–400 TWh, equivalent to a continuous 35–50 GW baseline
  • Global data center capacity additions forecast at 69–141 GW across 2025–2030
  • Multiple AI clusters planned to operate at or above 1 GW within the next year (Epoch AI Frontier Data Centers)
  • Mid-2030s scenarios: AI-related demand reaching low hundreds of gigawatts

These are not virtual systems; they draw real power, water, and transmission rights. Grid‑aware schedulers are already routing by carbon intensity and local peaks; flexible compute windows idle non‑critical workloads during spikes. Token generation has become a grid variable. In dense urban regions, unconstrained agent swarms are a load‑management problem as much as a cloud bill.

Where we likely land next

Expect frontier models to keep improving—yet feel smaller at serve time. Expect small, task‑tuned models with retrieval to handle most turns. Expect selective escalation to ultras. Expect safe, post‑deployment adaptation to trim tokens and rework. Expect operators to manage to three hard numbers: p95 latency roughly in the 150–250 millisecond range for consumer UX, dollars per million tokens, and kilowatt‑hours per million tokens. Placement will do much of the rest: keep small and mid models close to users, hold ultras in a few cores, and route by latency and locality so p95 stays predictable, even as agent traffic spikes or video enters the loop.

Seen in this light, the major ecosystems do not sit in opposition. OpenAI’s subscription and API revenues can feed training; Google’s push on tokens‑per‑watt can keep efficiency improving; clouds and neoclouds expand placement choices; serving platforms smooth the tails; application builders organize chains so the system does not pay “big” at every link. If those loops hold, the experience should feel faster even as total demand rises. The bills will still arrive, and the grid will still matter, but distance, routing, and discipline can make a gigawatt‑shaped future feel instant to the person on the other side of the screen.

Sequoia Capital’s 2025 analysis projects at least a tenfold increase in compute consumption per knowledge worker—with scenarios ranging from 1,000× to 10,000×—as AI augments day‑to‑day intellectual work. They frame this as “FLOPs per knowledge worker,” a simple way to think about how much compute the average professional will implicitly consume through AI‑powered tools. That expectation aligns with the themes here: rising usage, more agentic workflows, and a steady shift from episodic training to continuous inference.

“The pace of model progress, the spread of modalities, and the surge of entrepreneurship all point to one imperative: serve great models close to where people build and use them,” said Kevin Cochrane. “Our focus at Vultr is marrying edge efficiency with cloud elasticity so teams can move fast without losing control of latency or cost.”

So call it 2 gigawatts. It's not a prophecy, but rather, a baseline to plan against. With the meter in view and the playbook in hand, we can reach rapid, near instant outputs, with watts and dollars in check. 2 gigawatts is the measure of our ambition. What we do with that information can bend the curve of AI, or flatten it.