Spark Drops Microbatch for Real-Time

Apache Spark's Real-Time Mode (RTM) breaks microbatch barriers, enabling millisecond latency for streaming workloads with a new hybrid execution model.

2 min read
Diagram illustrating Apache Spark Real-Time Mode's hybrid execution architecture with concurrent stages.

Apache Spark's Structured Streaming has long been a go-to for high-throughput ETL workloads. However, operational use cases demanding millisecond responsiveness, like real-time fraud detection, presented a significant challenge. Databricks has now introduced Apache Spark Real-Time Mode (RTM) in version 4.1, aiming to bridge this gap and consolidate engine management.

Historically, organizations faced a trade-off: use Spark for throughput or opt for systems like Flink for low-latency streaming. RTM collapses this dichotomy, enabling a single engine to handle both, thereby simplifying infrastructure and reducing the learning curve. This move is a significant step, as detailed in Databricks' announcement, potentially allowing Spark to ditch dual engines for real-time mode.

The Microbatch Bottleneck

Spark's traditional microbatch architecture excels at processing data in discrete chunks, amortizing overheads and maximizing hardware utilization. This approach, however, introduces inherent latency due to fixed costs associated with each batch, including logging, state updates, and scheduling. Shrinking batch sizes to achieve real-time performance quickly hits a wall, as these fixed costs dominate execution time.

Databricks recognized that simply reducing batch size wasn't the answer.

The Hybrid Solution

RTM re-engineers Structured Streaming with a hybrid execution model. It maintains fault tolerance through checkpointing but eliminates latency-inducing waiting periods. This is achieved through several key architectural shifts.

  • Longer duration epochs with continuous data flow: Instead of discrete small batches, RTM processes data continuously through stages, amortizing checkpointing and barrier overheads across extended intervals.
  • Concurrent processing stages: Stages that previously executed sequentially now run concurrently. Reducers can begin processing shuffle files as soon as they are available, rather than waiting for all mappers to complete, dramatically cutting down end-to-end latency.
  • Non-blocking operators: Key operators, like shuffle, have been redesigned to minimize buffering and emit results continuously, ensuring data flows through the pipeline without unnecessary delays. This is crucial for Databricks streamlines real-time data apps and other low-latency streaming architecture needs.

This hybrid approach allows Spark to achieve sub-100ms responsiveness, making it suitable for critical, ultra-low latency applications. Databricks reports that RTM is already in production, powering real-time use cases for customers in finance and travel, demonstrating tangible millisecond latency improvements.