It’s come time to read the meter.
Every answer from an AI system draws power, time, and money. When a feature goes viral, those draws add up like clock ticks on a utility meter. Inference, the act of running a trained model to produce tokens, images, audio, or video, is no longer a rounding error. It is the day‑to‑day business of AI: what users feel as speed, what operators experience as throughput and tail latency, and what finance sees as a recurring bill.
OpenAI offers a clear sense of scale. ChatGPT usage sits in the hundreds of millions of users per month, with figures of 700, 800 million cited publicly. Its APIs have been described as processing about 8 billion tokens per minute across endpoints. Each token is a small unit of work: memory moves, matrix multiplies, cache reads, and network hops. At this altitude, shaving 20, 50 milliseconds from the decode loop or reducing compute per token by 10, 20 percent can be the difference between a product that feels instant and one that makes people wait.
