Databricks Speeds Up Analytics with Sketch Functions

Databricks enhances analytics with new sketch functions, delivering orders-of-magnitude speedups for percentile, distinct count, and top-K queries.

Databricks logo with abstract data visualization elements
Databricks introduces new sketch functions for accelerated analytics.

Databricks is rolling out a suite of new sketch functions, built on Apache DataSketches, designed to dramatically accelerate common analytical queries. These functions offer approximate answers to complex questions, enabling faster decision-making without the hefty compute costs associated with exact calculations.

The core benefit lies in transforming compute-intensive tasks like percentile calculations, distinct counts, and top-K rankings from minutes or hours into milliseconds. This is achieved by using bounded-memory approximations, typically with a configurable relative error of 1-2%, a trade-off deemed acceptable for many decision-support scenarios. This approach to approximate query processing significantly enhances data analytics performance optimization.

Related startups

Faster Percentiles

Calculating percentiles on massive datasets often requires global sorting, a process that can consume substantial resources and time. Databricks' new KLL quantile sketches, however, can compute quantiles like P50, P90, and P99 over trillions of data points using constant memory. These sketches are also mergeable, allowing for incremental updates and quick retrieval of percentile data from pre-computed summaries stored in Delta tables.

Efficient Audience Overlap Analysis

Understanding audience overlap across different campaigns is crucial for marketing. Traditional methods involve complex set operations on potentially billions of user IDs, which is computationally prohibitive. Theta sketches, now supported by Databricks, summarize distinct value sets in compact, mergeable formats. They enable rapid unions, intersections, and set differences, making detailed audience analysis practical and cost-effective.

Real-Time Leaderboards and Aggregations

Identifying trending items or real-time leaderboards from high-cardinality event streams has historically been a batch-oriented problem. Approximate top-K sketch functions, however, can track the most frequent items in bounded memory. These sketches can be merged across time windows or partitions, allowing for instant aggregation and the creation of live leaderboards without reprocessing raw data.

Combined Distinct Counts and Metric Aggregations

Tuple sketches offer a novel solution for simultaneously counting distinct entities and aggregating associated metrics. For instance, they can track unique customers and sum their total revenue in a single pass. This avoids double-counting and simplifies complex attribution tasks, offering significant advantages over traditional multi-step aggregation processes. These advancements contribute to the overall capabilities of the Databricks Lakehouse Platform features.

The introduction of these Databricks sketch functions provides a powerful toolset for organizations looking to extract more value from their data faster and more efficiently.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.