Preferred on Google

Claude Code Benchmarking: Semantic Search vs. Grep

Turbopuffer's Kuba Rogut benchmarks semantic code retrieval on Claude Code, revealing how semantic search enhances AI agent precision and efficiency compared to grep.

Jun 3 at 4:02 PM8 min read

Kuba Rogut presenting 'Benchmarking semantic code retrieval on Claude Code' at an AI Engineer Europe event. — Kuba Rogut presenting on semantic code retrieval at AI Engineer Europe.· AI Engineer

Visual TL;DR. AI Code Search Need leads to Claude Code Benchmarking. Claude Code Benchmarking reveals Semantic Search Advantage. Grep Limitations compared Claude Code Benchmarking. Semantic Search Advantage leads to Enhanced AI Precision. Embeddings Role enables Semantic Search Advantage. Claude Code Benchmarking discusses Future Directions.

AI Code Search Need: AI models need better code understanding for agents
Claude Code Benchmarking: Kuba Rogut benchmarks semantic search vs. grep
Semantic Search Advantage: Agentic search understands code meaning, simpler to implement
Grep Limitations: Traditional grep struggles with semantic code understanding
Embeddings Role: Embeddings power semantic understanding of code
Enhanced AI Precision: Semantic search improves AI agent precision and efficiency
Future Directions: Exploring further applications of semantic code retrieval

Visual TL;DRQuickExplainDeeper

Kuba Rogut from Turbopuffer recently presented a deep dive into Benchmarking semantic code retrieval on Claude Code, exploring how different approaches impact the performance of AI agents in understanding and navigating codebases. The presentation, titled "Benchmarking semantic code retrieval on Claude Code," highlighted key findings on the efficacy of semantic search compared to traditional methods like 'grep'.

Claude Code Benchmarking: Semantic Search vs. Grep - AI Engineer — Claude Code Benchmarking: Semantic Search vs. Grep — from AI Engineer

Understanding the Need for Semantic Code Retrieval

Rogut began by referencing a discussion on Twitter regarding why certain AI models, like Codex and Claude, do not inherently use cloud-based embeddings for code search. A key insight shared was that early versions of Claude Code did indeed utilize Retrieval Augmented Generation (RAG) with a local vector database. However, the team discovered that agentic search, which relies on understanding the semantic meaning of code, generally performed better and was simpler to implement. This approach also sidesteps potential issues related to security, privacy, staleness, and reliability that can arise with other methods.

The presentation then showcased research from Cursor, which demonstrated that semantic search significantly improves code retention and reduces user dissatisfaction. This was illustrated through benchmarks showing performance improvements across various models, with semantic search yielding notable gains in answer accuracy and overall efficiency.

Benchmarking Methodology and Results

Turbopuffer developed a benchmark called 'ContextBench' to evaluate AI agents' ability to read specific 'golden files and lines' and understand the context needed to solve tasks. This benchmark was designed to test agents on a subset of 50 tasks that do not explicitly mention files or functions by name, thereby focusing on their ability to infer context. Three conditions were tested:

Baseline: Raw Claude Code without any enhancements.
Windowed: Claude Code with a maximum of 50 lines read at a time.
Semantic: Claude Code with a maximum of 50 lines read, augmented by a 'grep + semantic search CLI tool'.

The results indicated a clear advantage for semantic search. For instance, in terms of precision (measuring how often agents read only necessary files), the baseline Claude Code showed 1 in 3 file reads being wasted on irrelevant code. This improved to 1 in 5 reads with 'grep' and further to only 1 in 8 reads being wasted when semantic search was implemented. This suggests that semantic search helps agents focus on more relevant parts of the codebase, reducing wasted computational effort.

Regarding recall (how many needed files the agent found), the baseline Claude Code performed best in terms of sheer volume, finding more files but at the cost of efficiency due to irrelevant reads. Both 'grep' and 'grep + semantic' found fewer files but with a higher proportion of relevant functions. The 'grep + semantic' approach showed similar recall to 'grep' but with slightly worse performance in some cases, suggesting a nuanced interaction between the tools and the underlying code structure.

The Role of Embeddings and Future Directions

Rogut emphasized that embeddings are essentially 'cached compute', meaning that once they are generated, they can be efficiently retrieved. This amortized understanding is crucial for scaling AI agent capabilities. The presentation also touched upon the internal workings of the 'turbogrep-v2' CLI tool, highlighting its components like chunking, embedding, and indexing. The tool leverages the Voyage AI HTTP client for embedding and integrates with a tree-sitter library for code parsing.

The core takeaway from the benchmark was that semantic search significantly boosts precision, leading to more efficient and effective code understanding for AI agents. While 'grep' remains a valuable tool for its simplicity and zero cost, semantic search offers a more nuanced understanding of code context. Rogut concluded by suggesting that future winners in this space will likely provide lightweight tools that can find the right context in various ways, catering to different workloads and data types.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Claude Code #Turbopuffer #Kuba Rogut #AI Research #Semantic Search #Code Retrieval #Vector Databases #Benchmarking #AI Agents #Grep