The University of California San Diego's Hao AI Lab recently acquired an NVIDIA DGX B200 system, a move poised to significantly accelerate research into low-latency LLM serving. This powerful new hardware will enable the lab to push the boundaries of large language model inference, directly impacting how quickly and efficiently generative AI can respond to user requests. The Hao AI Lab is no stranger to this domain; its foundational research, including DistServe, has already shaped production LLM inference platforms like NVIDIA Dynamo.
A core contribution from the Hao AI Lab is the concept of "goodput," a critical metric for evaluating LLM serving performance. Historically, systems focused solely on "throughput," measuring tokens generated per second across the entire system. While high throughput reduces token generation cost, it often comes at the expense of user-perceived latency, creating a fundamental trade-off. Goodput, however, measures throughput while strictly adhering to user-specified latency objectives, ensuring both efficiency and a quality user experience.
