Databricks is rolling out a significant performance upgrade for its Agent Bricks Knowledge Assistant, slashing search times by more than 3x. This boost, which also halves answer generation time, brings the time to first token down to approximately two seconds. The enhancements are powered by a new model called Instructed-Retriever-1, detailed in a Databricks blog post, which leverages a technique called parallel test-time scaling.
Traditional agentic search systems often process results sequentially, leading to higher latency. Instructed-Retriever-1 flips this by parallelizing the initial search phase. This allows for broader evidence retrieval and more precise context selection upfront, dramatically cutting down response times.
Parallelizing the Search Pipeline
The core innovation lies in how Databricks approaches test-time computation. Instead of spending compute sequentially on steps like tool calls or reasoning, the system fans these tasks out in parallel during the initial search. This broadens the retrieved evidence and refines it efficiently.
Instructed-Retriever-1 is a single model trained for two critical retrieval stages: query generation to enhance recall and reranking to boost precision. These run concurrently to maintain low latency.