Databricks is rolling out a suite of new sketch functions, built on Apache DataSketches, designed to dramatically accelerate common analytical queries. These functions offer approximate answers to complex questions, enabling faster decision-making without the hefty compute costs associated with exact calculations.
The core benefit lies in transforming compute-intensive tasks like percentile calculations, distinct counts, and top-K rankings from minutes or hours into milliseconds. This is achieved by using bounded-memory approximations, typically with a configurable relative error of 1-2%, a trade-off deemed acceptable for many decision-support scenarios. This approach to approximate query processing significantly enhances data analytics performance optimization.
Faster Percentiles
Calculating percentiles on massive datasets often requires global sorting, a process that can consume substantial resources and time. Databricks' new KLL quantile sketches, however, can compute quantiles like P50, P90, and P99 over trillions of data points using constant memory. These sketches are also mergeable, allowing for incremental updates and quick retrieval of percentile data from pre-computed summaries stored in Delta tables.