It’s come time to read the meter.
Every answer from an AI system draws power, time, and money. When a feature goes viral, those draws add up like clock ticks on a utility meter. Inference—the act of running a trained model to produce tokens, images, audio, or video—is no longer a rounding error. It is the day‑to‑day business of AI: what users feel as speed, what operators experience as throughput and tail latency, and what finance sees as a recurring bill.
OpenAI offers a clear sense of scale. ChatGPT usage sits in the hundreds of millions of users per month, with figures of 700–800 million cited publicly. Its APIs have been described as processing about 8 billion tokens per minute across endpoints. Each token is a small unit of work: memory moves, matrix multiplies, cache reads, and network hops. At this altitude, shaving 20–50 milliseconds from the decode loop or reducing compute per token by 10–20 percent can be the difference between a product that feels instant and one that makes people wait.
What inference is, and why it sped up
First, a quick grounding: Training is about learning; inference is about doing.
Inference happens right now, for this user, for this request. Three constraints govern the experience and the economics: latency (how long someone waits), throughput and its tail behavior (how many people you can serve at once and how predictable the slowest turns are), and unit cost (dollars per request, often summarized as dollars per million tokens). The serving playbook pushes on those constraints in four practical ways: reuse work already done, avoid work that does not need to be done, place the work closer to the user, and keep accelerators busy without letting the slowest requests stretch the tail.
In plain terms, that playbook starts with model compression: use quantization (INT8/INT4, FP8/NF4) so weights and activations take fewer bits, fit in memory, and run faster. Second, adaptive compute: route easy questions to small models and escalate only when needed; inside a large model, use mixture‑of‑experts so only a subset of “experts” activates per token rather than the entire network. Third, decoding and attention efficiency: use speculative decoding so a small “drafter” proposes a short run of tokens and the target model verifies them in one pass; maintain a KV cache so the model does not recompute the entire history at every step; adopt attention kernels that move fewer bytes. Fourth, system and hardware optimization: employ iteration or continuous batching so new requests can join in‑flight batches between decode steps; use better kernels to reduce memory traffic; and place latency‑sensitive models near users.
Key-Value caching saves the intermediate attention states from previous tokens so the model doesn’t need to recompute them each time it generates a new word. By reusing these stored values, responses flow faster and cost less energy, especially for long prompts or multi-turn conversations.
Quantization reduces the numerical precision of a model’s weights and activations—often from 16-bit to 8-bit or lower—so calculations run faster and memory usage drops sharply. Done carefully, it keeps accuracy nearly identical while improving throughput and cutting power draw.
Mixture-of-experts models divide a large network into many smaller “experts.” For any given token, only a few experts activate, saving compute while preserving quality. It’s a way to make giant models behave efficiently at inference time.
Speculative decoding lets a small “draft” model guess several upcoming tokens that a larger model then verifies in parallel. If most guesses are right, the system leaps ahead—reducing the waiting time between a user’s input and the model’s full answer.
Adaptive compute means the system adjusts how much processing power each query receives. Simple prompts take a light path through smaller or shallower networks; complex ones trigger heavier routes. It keeps latency low and budgets predictable.
Model compression covers pruning, distillation, and quantization techniques that shrink model size and speed up inference. The idea is to keep almost the same intelligence in fewer parameters so deployment is cheaper and fits on smaller hardware.
Batching groups multiple user requests together so a GPU can process them in one pass. Continuous batching takes it further, inserting new requests into ongoing computation streams. Both maximize GPU utilization and reduce idle cycles, translating into lower cost per token.
Attention optimization rewrites the GPU math so data moves less between memory and cores. FlashAttention is one such method: it fuses operations into a single kernel, slashing overhead and speeding long-context processing dramatically.
Modern GPUs and TPUs can run matrix math in smaller number formats such as FP8 or BF16. This cuts the time and energy needed for each operation while maintaining model fidelity through calibration—boosting tokens per watt.
Routing systems decide which model handles a request: small models for everyday tasks, large ones for complex reasoning. It’s like triage for compute—minimizing cost while maintaining reliability and user experience.
Prompt caching stores common or repeated prompts—like a chatbot’s system message or instructions—so they don’t need reprocessing. The effect is faster first-token times and smoother repeated interactions across sessions.
Schedulers now place workloads on GPUs and regions based on latency, carbon intensity, or local power price. Putting small models near users and large ones in central clusters shortens response times and balances the grid load.
GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) are the specialized chips that run AI models. GPUs, originally built for rendering graphics, excel at parallel math across thousands of cores. TPUs, custom-built by Google, are optimized for tensor operations common in neural networks. Both process massive amounts of data simultaneously, but TPUs focus on efficiency and throughput in data centers, while GPUs offer flexibility for developers and startups. The faster and denser these chips get, the more tokens per second they can deliver—and the lower the cost of every AI response.
For years, most concern focused on training. That concern was justified; the numbers are concrete and visible. Epoch AI has tallied training compute growth at roughly 4× per year on average since 2010. By mid‑2025, more than 30 frontier models had passed the 1e25 FLOP threshold. Run durations have stretched from months toward a year at the very frontier. These are counts, not impressions, and they matter for serving. Longer context windows, richer modalities, and higher ceilings on reasoning make products more capable, but they also raise the steady‑state work a service must perform. Training happens in bursts; inference runs all day. In other words, once a model ships, the recurring bill starts rather than stops.
How people use these systems is shifting as well.
Agentic AI turns a single request into many small steps. Ask for a working prototype, a code refactor, a research brief, or a product plan, and the system does not make one pass. It plans sub‑tasks, retrieves information, calls tools, writes and tests code, packages artifacts, performs checks, deploys, and revises. Each of those steps can involve multiple model calls. A single prompt can quietly trigger thousands of inferences, some in sequence and some in parallel. And this recursiveness is not limited to coding. Everyday work—summarizing meetings, preparing proposals, coordinating schedules, producing first drafts—naturally expands into chains of retrieval, drafting, review, and revision. That compounds the depth and activity we should expect as agentic coworkers become normal in knowledge work. For readers who want a snapshot of what people actually use, the a16z lists of fast‑rising AI apps and their analysis on agentic coworkers show how quickly these patterns are moving from demos into daily usage.
To be fair, per-call inference costs are plummeting—dropping up to 1,000x since 2022 thanks to efficient models and new GPUS, like Blackwell, making advanced AI more accessible. Yet, this is offset by surging call intensity in Agentic AI systems, spawning recursive chains of thousands of back-to-back inferences.
Now the curve steepens further with multimodal.
What began with text now routinely includes images and audio—and increasingly, full video—which turns a steady stream of tokens into something closer to highway traffic. An image is a grid of pixels; audio is a continuous waveform; video multiplies both frame by frame, second by second. Moving from text to video is a step change, not a gentle incline. Serving video well means tiled or paged attention, chunked diffusion, frame‑level caching, and distributed schedulers that keep accelerators saturated while holding tails in check. The same serving principles still apply—reuse, avoid, place, batch—but the volume of computation per second is simply larger.
A practical response at the application layer is to arrange chains so the system avoids paying “big” at every link. Emergent illustrates this approach. Emergent is an Agentic coding platform, allowing consumers to build production grade apps via natural-language and backend AI agents.



