Superhuman Hits 200K QPS With Databricks

Superhuman and Databricks engineers collaborated to build an AI inference platform serving over 200K QPS with sub-second latency.

Databricks and Superhuman logos side-by-side with network graphic.
Superhuman leverages Databricks for its high-QPS AI inference needs.

Superhuman, the productivity platform known for its email client and Coda, has partnered with Databricks to scale its AI-powered writing assistance to an impressive 200,000 queries per second (QPS). This significant achievement was detailed in a recent Databricks blog post, highlighting how the companies jointly engineered a high-throughput, low-latency AI serving platform.

The collaboration focused on modernizing Superhuman's inference stack, which handles real-time suggestions for correctness, clarity, tone, and style. Previously, the company relied on a custom vLLM stack, which, while capable of massive scale, presented operational challenges and required significant manual tuning for each new model iteration.

Modernizing the Serving Stack

Superhuman's core AI model, responsible for grammatical error correction at peak traffic exceeding 200,000 QPS, was pushing the limits of its existing infrastructure. The need for a platform partner committed to performance and latency Service Level Objectives (SLOs) became paramount.

Related startups

Both teams established ambitious real-time latency targets: sub-second P99 latency and zero quality regression. This necessitated a deep dive into both the platform infrastructure and the model optimizations themselves.

Meeting Real-Time SLAs on Platform Infrastructure

Achieving high QPS reliably demands robust infrastructure capable of sophisticated load balancing and dynamic scaling. Superhuman's traffic patterns exhibit strong diurnal variations, with rapid ramps that can exceed 200k QPS.

To address this, a custom load balancing algorithm based on the "power of two choices" was implemented. This method routes requests to the least loaded of two randomly selected pods, preventing the hotspots that can occur with standard round-robin balancing at high loads.

Dynamic autoscaling was also crucial. The system tracks average request concurrency and scales up aggressively while maintaining a conservative scale-down strategy to prevent latency spikes. Joint shadow testing helped fine-tune these autoscaling parameters.

Furthermore, image acceleration techniques, originally developed for serverless compute, were adapted to drastically reduce container startup times. This allows new pods to launch in seconds rather than minutes, ensuring smoother performance during traffic surges.

Runtime Optimizations Drive Throughput Gains

The core of the performance leap came from runtime optimizations that boosted per-pod throughput by 60%, from 750 QPS to 1,200 QPS on H100 GPUs, without impacting model quality.

FP8 quantization was a key factor, contributing up to a 30% increase in QPS. Superhuman's ML team pre-quantized model checkpoints to FP8, and Databricks' serving engine efficiently loaded these compressed formats. Both teams collaborated to determine the optimal layers for quantization, ensuring quality was maintained.

The elimination of CPU-side bottlenecks was another significant win. By introducing a multiprocessing runtime server, the system now prepares and dispatches work to the GPU in parallel, overcoming the single-process serialization bottleneck. This change alone delivered an additional 20% throughput boost.

Further CPU-side optimizations, including replacing Python-level tensor operations with C++ calls and reducing GPU idle time through asynchronous scheduling, contributed to the overall efficiency. These advancements demonstrate how Databricks' inference platform can be tailored for demanding, high-volume AI workloads, enabling companies like Superhuman to focus on product innovation.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.