Databricks Tackles Code Complexity for AI Assistants

Databricks details how AST-based chunking and MLflow evaluation improve AI assistants' understanding of complex codebases.

3 min read
Databricks blog post illustrating code chunking strategies for AI assistants.
Image credit: StartupHub.ai

Navigating unfamiliar codebases is a perennial challenge for developers. Databricks aims to streamline this with its Knowledge Assistant, a tool designed to answer natural-language questions about code. However, the effectiveness of such assistants hinges on how the source code is processed and indexed, particularly the method of chunking, or splitting code into manageable pieces. As detailed in a recent Databricks blog post, RAG over code chunking presents distinct difficulties compared to traditional document analysis.

Unlike linear business documents, code possesses intricate hierarchical structures—files contain classes, which house methods, filled with logical blocks. The semantic unit in code is often a complete function, not a paragraph. This complexity means that splitting code arbitrarily can sever crucial context, rendering even advanced assistants less effective. For instance, a chunk containing only part of a function definition or lacking its surrounding class context is significantly less useful.

The Chunking Conundrum

Databricks experimented with three distinct chunking strategies for its Knowledge Assistant, using its Casper’s Kitchens demo repository as a testbed. This repository, simulating a ghost kitchen business, incorporates a wide array of Databricks features and features code in various formats, including Python files, Jupyter notebooks, Markdown, and YAML. The goal was to determine which chunking approach best preserves the code's structural integrity and semantic meaning for AI comprehension.

The simplest method, a naive baseline using fixed-size character chunks with overlap, treats code as plain text. This approach is easy to implement but fundamentally flawed for code, often resulting in fragmented variables and incomplete code blocks. For example, searching for "deletion_order" might yield "eletion" due to the cut-off.

A step up, LangChain's RecursiveCharacterTextSplitter, employs language-specific separators to favor logical boundaries like function and class definitions. While better at preserving functions, it still imposes strict size limits, sometimes leading to incomplete context.

The most sophisticated strategy involved Abstract Syntax Tree (AST) based chunking, utilizing parsers like Tree-sitter. This method analyzes the code's actual syntactic structure, chunking at semantic boundaries such as entire functions, classes, or statement blocks. Crucially, it can exceed size limits to keep a complete unit intact. The AST-based approach also prepends metadata headers, including file path and hierarchy, to each chunk, enhancing retrieval accuracy.

Systematic Evaluation with MLflow

To objectively compare these strategies, Databricks leveraged MLflow’s GenAI evaluation framework. This framework allows for systematic comparison of retrieval systems by using an evaluation dataset, a prediction function, and LLM judges.

An evaluation dataset of 46 questions was curated, focusing on disambiguation queries that highlight weaknesses in chunking, such as distinguishing between similar code constructs in different contexts. The judges included built-in metrics like RetrievalSufficiency and RetrievalGroundedness, alongside a custom answer_correctness scorer to assess accuracy nuances.

All strategies were tested with a target chunk size of 1,000 characters and an overlap of 200 characters, using the same embedding model. The results, visualized in MLflow’s experiment UI, demonstrated that AST-based chunking with metadata headers significantly improved retrieval quality by preserving essential code context.

The evaluation process itself underscored the value of structured datasets, traceable results, and custom LLM judges tailored to specific evaluation criteria, making the iterative improvement of RAG over code chunking practical. This approach is key to building more robust and intelligent code-aware AI assistants, offering a glimpse into the future of developer productivity and agentic AI.