Spark Streaming Hits Millisecond Latency

Databricks' Apache Spark Structured Streaming real-time mode is now GA, offering sub-second latency and consolidating streaming needs onto a single engine.

3 min read
Databricks logo with abstract data visualization background
Image credit: StartupHub.ai

Databricks has moved its Apache Spark Structured Streaming real-time mode out of preview, bringing true millisecond-level latency to the platform. This advancement aims to consolidate real-time data processing needs onto a single engine, ending the era of maintaining separate, specialized systems like Apache Flink alongside Spark.

For years, organizations have relied on Spark Structured Streaming for demanding workloads. However, ultra-low latency applications often necessitated the use of additional engines, leading to duplicated code, governance complexities, and increased operational overhead. Databricks' Real-Time Mode, now generally available, promises to resolve this by delivering sub-100ms processing speeds directly within familiar Spark APIs.

This architectural shift is driven by three core innovations: continuous data flow, which processes data as it arrives rather than in batches; pipeline scheduling, allowing stages to run concurrently; and streaming shuffle, which bypasses traditional disk I/O bottlenecks. These changes transform Spark into a high-performance engine capable of powering time-critical applications.

Real-World Impact and Use Cases

Industry giants are already seeing tangible benefits. Coinbase reports an 80%+ reduction in end-to-end latency, achieving sub-100ms P99s for fraud detection and risk management. DraftKings is using the mode for real-time feature computation in fraud detection models for live sports betting, achieving previously impossible ultra-low latencies.

MakeMyTrip leverages Real-Time Mode for personalized search experiences, delivering sub-50ms P50 latencies and a 7% click-through rate uplift. The company also highlights its ability to unify data operations, handling everything from ETL to low-latency pipelines within Spark.

The applications extend across various sectors, including real-time personalization for media and retail, IoT anomaly detection, and high-speed fraud flagging for financial services. This capability is crucial for emerging use cases like steering AI agents with the most current data context.

Simplifying the Streaming Landscape

Databricks claims its Real-Time Mode is up to 92% faster than Apache Flink in benchmarks for feature computation tasks. More importantly, it offers a unified development experience, allowing teams to use the same Spark APIs for both batch training and real-time inference, thereby eliminating logic drift and code duplication.

A single line of code can reportedly shift a pipeline from hourly batches to sub-second streaming, drastically simplifying infrastructure management and accelerating deployment cycles. This consolidation reduces the need for specialized streaming engines, making Apache Spark Structured Streaming real-time mode a compelling option for many organizations.

The move represents a significant evolution for streaming data processing, allowing Spark to handle operational, latency-sensitive applications previously out of its reach. It's an effort to bring the simplicity and broad ecosystem of Spark to the most demanding real-time use cases, as detailed in the original Databricks announcement.

Getting started requires a simple configuration update to existing Structured Streaming queries. Databricks Runtime 18.1 or above is recommended for optimal performance. This release also brings open-source support for stateless transformations in Apache Spark 4.1 and enhanced asynchronous checkpointing for improved stateful processing.

Databricks is positioning this as a way to extend Spark's reach into a new class of workloads.