The Complete Guide to Text Indexing for AI Coding Agents: From Trigrams to Sparse N-Grams
AI coding agents have a search problem. The models are fast, the reasoning is good, and the code generation keeps getting better. But before an agent can write a single line, it needs to find the right context. And finding context in a large codebase means grep. A lot of grep.
Cursor's agents routinely invoke ripgrep, the fastest grep implementation available, and still wait 15+ seconds for results in enterprise monorepos. That is an eternity in a workflow where a developer is actively guiding an agent in real time. On March 23, 2026, Cursor published research on how they are building local text indexes to solve this problem. Their work draws on 30+ years of computer science, from the 1993 paper that first proposed n-gram inverted indexes to the sparse n-gram algorithms that power GitHub's Code Search today.
This guide covers the full technical story. We start with the fundamental problem, explain each indexing approach from first principles, compare what every major AI coding tool does for search, and examine why this problem matters more now than it ever has.
This guide is written by Yuma Heymans (@yumahey), founder of o-mega.ai, the AI workforce platform where autonomous agents learn to use business tool stacks and execute workflows.
Contents
- The Problem: Why AI Agents Spend Most of Their Time Searching
- Why Grep Is Still Fundamental
- How ripgrep Works (and Why It Is Not Enough)
- Inverted Indexes: The Foundation of All Search
- Trigram Indexing: The 1993 Breakthrough
- Suffix Arrays: The Alternative Path
- Probabilistic Masks: Getting to "3.5-Grams"
- Sparse N-Grams: The Current State of the Art
- Client-Side vs. Server-Side: Where Should the Index Live?
- How Every Major AI Coding Tool Handles Search
- The Relationship Between Search Latency and Agent Quality
- What Comes Next
1. The Problem: Why AI Agents Spend Most of Their Time Searching
Before an AI coding agent can fix a bug, add a feature, or refactor code, it needs to understand the codebase. This understanding comes from searching: finding relevant files, reading them, finding related code, reading that, and iterating until the agent has enough context to act.
Research from Cognition (the company behind Devin) found that their agents were spending over 60% of their first turn just on context retrieval. Not thinking, not planning, not writing code. Searching. The bulk of this time was spent on grep and file reading operations, the most basic text operations in computing.
The problem scales linearly with codebase size. A 100MB project searches fast. A 10GB enterprise monorepo does not. And enterprise monorepos are exactly where AI coding agents need to deliver the most value, because those are the codebases where developers struggle the most with navigation and comprehension.
The core bottleneck is straightforward: traditional grep must scan every file on every invocation. There is no memory of previous searches. No index to consult. No way to skip files that cannot possibly contain a match. Every search starts from scratch, reading bytes from disk at whatever throughput the storage system allows.
This is the problem that text indexing solves. By precomputing data structures that map text patterns to the files containing them, an index can reduce the search space from "every file in the repository" to "only the files that could possibly match." The difference between searching 100,000 files and searching 25 files is the difference between a 15-second wait and an instant response.
2. Why Grep Is Still Fundamental
You might expect that modern AI coding agents would use more sophisticated tools than grep. Language servers parse code into abstract syntax trees. Semantic search embeds code into vector spaces. Knowledge graphs map relationships between symbols. All of these exist, and all of them are useful for specific types of queries.
But agents keep coming back to grep.
The reason is that many of the queries agents need to make are precise, literal, and structural. "Where is this function called?" "What files import this module?" "Where is this error message defined?" "What configuration sets this value?" These questions require exact string matching or regular expression matching. No amount of semantic understanding helps when you need to find every line that contains MAX_RETRY_COUNT = followed by a number.
Cursor's own research (November 2025) on semantic search confirmed this. Adding semantic search improved agent accuracy by 12.5% on average. But the improvement came from using semantic search alongside grep, not as a replacement for it. The combination of exact search and semantic search produced the best outcomes, and the agents continued to use grep heavily even when semantic search was available.
This pattern holds across tools. Claude Code uses grep and glob as its primary search tools with no built-in semantic index. GitHub Copilot uses grep in combination with its code search API. Windsurf indexes code semantically but still provides grep tools. Every agent harness has learned the same lesson: you cannot replace grep, you can only augment it.
The practical consequence is that grep performance directly bounds agent performance. If grep is slow, the agent is slow. If grep is fast, the agent can iterate rapidly, searching, reading, adjusting, and searching again in the tight loops that characterize effective codebase exploration. Making grep faster is not an optimization. It is a prerequisite.
3. How ripgrep Works (and Why It Is Not Enough)
ripgrep (rg), developed by Andrew Gallant, is the fastest known grep implementation for unindexed search. Most AI coding agent harnesses, including Cursor and Claude Code, use ripgrep as their search backend. Understanding why it is fast, and why fast is still not fast enough, helps frame the indexing problem.
Why ripgrep Is Fast
Literal extraction from regex. For a pattern like \w+foo\d+, ripgrep extracts the literal substring foo and uses optimized byte scanning (via memchr) to find candidate positions before ever engaging the regex engine. Most regex patterns contain at least some literal substring, and finding that literal first dramatically reduces the work the regex engine needs to do.
The Teddy algorithm. For multi-pattern literal searches, ripgrep uses a SIMD-accelerated algorithm (adapted from Intel's Hyperscan library) that compares 16 bytes simultaneously using packed CPU instructions. This allows scanning at speeds approaching memory bandwidth limits.
Rare byte selection. When scanning for a literal, ripgrep picks the statistically rarest byte in the pattern to search for first. If you are looking for CONFIGURATION, the byte Q is much rarer than O or N in typical source code, so scanning for Q first means fewer false positive positions to check.
Lazy DFA compilation. Rather than building a complete deterministic finite automaton from the regex before searching (which can be expensive for complex patterns), ripgrep builds the DFA incrementally during the search itself. States are created only as they are actually reached.
UTF-8 aware DFA. Character decoding happens inside the automaton rather than as a separate preprocessing step, eliminating redundant work.
Smart I/O. For single large files, ripgrep uses memory mapping (mmap) which provides a ~25% speed improvement. For directory traversal with many small files, it uses buffered I/O because the overhead of creating and destroying memory mappings per file outweighs the benefits.
Performance Numbers
On a 1GB subtitle file:
- Literal English search: ripgrep 0.268s vs. GNU grep 0.516s
- 5-alternation pattern: ripgrep 0.294s vs. GNU grep 2.955s (10x faster)
- Case-insensitive full Unicode: ripgrep ~0.35s vs. Platinum Searcher 17.2s (50x faster)
These numbers are impressive. But they still represent scanning the entire file, byte by byte. On a 10GB monorepo, even at 1GB/s throughput, the minimum wall-clock time for a full scan is 10 seconds. No amount of algorithmic optimization can change this, because the bottleneck is reading every byte, not processing it.
This is the fundamental limitation of unindexed search: the work scales linearly with the total size of the corpus, regardless of how few files actually contain a match. An index breaks this relationship by allowing the search to skip files that cannot possibly match.
4. Inverted Indexes: The Foundation of All Search
Every approach to text indexing builds on the same foundation: the inverted index. Before discussing trigrams, suffix arrays, or sparse n-grams, it is worth understanding this data structure properly, because it is the conceptual substrate for everything that follows.
An inverted index maps tokens to the documents that contain them. The name "inverted" comes from the fact that it reverses the natural relationship. A document contains tokens. An inverted index maps tokens back to documents.
How It Works
Given a set of documents, the construction process is:
- Tokenize each document into a set of tokens.
- For each token, record the document ID in that token's posting list (a sorted list of all documents containing the token).
- Store all posting lists in a dictionary-like structure, keyed by token.
To search, tokenize the query into the same types of tokens, look up each token's posting list, and intersect the posting lists to find documents that contain all query tokens.
Why It Is Fast
The key insight is that looking up a token in the index and retrieving its posting list is an O(1) dictionary lookup (or O(log n) with binary search), regardless of how many documents exist in total. Intersecting two sorted posting lists is O(n) in the length of the shorter list. This means the total search cost depends on how many documents match, not on how many documents exist. For highly selective tokens (rare identifiers, unusual strings), the posting lists are tiny, and the search completes almost instantly.
The Tokenization Problem
The entire design hinges on one question: how do you tokenize documents and queries so that the index is useful?
For natural language search, the answer is straightforward: split on whitespace and punctuation, normalize case, maybe apply stemming. This is how Google, Elasticsearch, and every web search engine works.
For source code, the answer is much harder. Code does not have the same structure as prose. Identifiers can be CamelCase, snake_case, or SCREAMING_CASE. Punctuation is syntactically meaningful. A search for pthread_getname_np should match the string pthread_getname_np, but a word-based tokenizer might split it into pthread, getname, and np, losing the ability to match the full identifier.
GitHub learned this the hard way. Their original code search (2011-2020) used Elasticsearch with custom tokenizers that split on punctuation and extracted CamelCase/snake_case components. Searching for thread_getname could not find pthread_getname_np because the tokenizer had already destroyed the substring relationships. The feature had poor user satisfaction for years.
The solution, as we will see, is to not tokenize by words at all. Instead, tokenize by character subsequences: trigrams, quadgrams, or sparse n-grams. This preserves substring matching and enables regex support, at the cost of more complex index construction and query processing.
5. Trigram Indexing: The 1993 Breakthrough
The idea of using character n-grams (specifically trigrams, sequences of 3 characters) for indexing text was first published in 1993 by Zobel, Moffat, and Sacks-Davis in their paper "Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files." But the concept reached most developers through Russ Cox's 2012 blog post explaining how Google Code Search worked.
Why Trigrams?
The choice of 3 characters is a balancing act between two competing pressures:
Bigrams (2 characters) produce very few distinct tokens. With printable ASCII, there are roughly 95^2 = 9,025 possible bigrams. This means the index has few keys, but each key's posting list is enormous, because almost every document contains common bigrams like th, he, in, er. The posting lists are too large to intersect efficiently, and the selectivity is too low to meaningfully reduce the search space.
Quadgrams (4 characters) produce many distinct tokens. With printable ASCII, there are 95^4 = approximately 81 million possible quadgrams. The posting lists are small and highly selective, which is great for queries. But storing 81 million keys in the index is expensive, and many of those keys will have posting lists containing only a single document. The storage overhead becomes impractical.
Trigrams (3 characters) sit in the middle. Roughly 95^3 = 857,375 possible trigrams. Enough distinct keys to be useful, few enough to store efficiently. Posting lists are manageable in size. This is the sweet spot that has stood the test of time.
Indexing Documents
Tokenizing a document into trigrams is simple: extract every overlapping sequence of 3 characters.
The string HELLO produces the trigrams: HEL, ELL, LLO.
The string MAX_FILE_SIZE produces: MAX, AX_, X_F, _FI, FIL, ILE, LE_, E_S, _SI, SIZ, IZE.
Each trigram becomes a key in the inverted index, mapping to the document ID.
Querying with Regular Expressions
This is where the cleverness comes in. Given a regex, the system recursively analyzes its structure to extract the trigrams that any matching string must contain:
- Literal string
abc: must contain trigramsabc. Simple. - Concatenation
abc+def: must contain trigrams from both, plus any trigrams spanning the boundary (cde,def). - Alternation
abc|xyz: must contain trigrams common to both branches. Since a match could go either way, only trigrams present in ALL branches can be required. - Repetition
a*,a+: no trigrams can be required, because the repetition might match zero or one character, which is too short for a trigram. - Character classes
[abc]: single character, too short for a trigram by itself.
The extracted trigrams become an AND query on the inverted index. Load all posting lists, intersect them, and the result is a set of candidate documents that could potentially match the regex. These candidates are then verified by actually running the regex on each file.
Performance
Russ Cox's benchmark on a 420MB Linux kernel tree (36,972 files):
- Query
DATAKIT: trigrams narrow candidates from 36,972 files to 3 files. - Query
hello world: narrows to 25 files. - Speed: 0.01 seconds indexed vs. 1.96 seconds brute force. Roughly 200x speedup.
Index size: approximately 20% of the source corpus. For the 420MB kernel, the index is about 85MB.
Limitations
Trigram indexing works well for small to medium corpora. But at scale, two problems emerge:
-
Common trigrams produce large posting lists. Trigrams like
the,for,int,retappear in virtually every source file. Their posting lists contain every document in the index, providing zero selectivity. Intersecting a large posting list with a small one wastes time. -
Complex regex patterns decompose poorly. A pattern with many character classes, alternations, or repetitions might yield only one or two required trigrams, resulting in a candidate set that is barely smaller than the full corpus.
GitHub's experience confirmed both limitations. When they attempted to use trigram indexing for their 15+ billion document corpus, the posting lists for common trigrams were so large that intersecting them was slower than just scanning files directly. This motivated the development of more sophisticated approaches.
6. Suffix Arrays: The Alternative Path
In 2015, Nelson Elhage (a former Stripe engineer and co-founder of Ksplice) took a fundamentally different approach for his livegrep project, which provides real-time regex search over the Linux kernel source code.
How Suffix Arrays Work
A suffix array is conceptually simple: take a string, enumerate all of its suffixes, and sort them lexicographically. A suffix is everything from a given position to the end of the string.
For the string banana:
| Position | Suffix |
|---|---|
| 0 | banana |
| 1 | anana |
| 2 | nana |
| 3 | ana |
| 4 | na |
| 5 | a |
Sorted lexicographically:
| Sort Order | Position | Suffix |
|---|---|---|
| 0 | 5 | a |
| 1 | 3 | ana |
| 2 | 1 | anana |
| 3 | 0 | banana |
| 4 | 4 | na |
| 5 | 2 | nana |
The suffix array itself stores only the positions: [5, 3, 1, 0, 4, 2].
Why This Is Useful for Search
Because the suffixes are sorted, any substring search becomes a binary search. To find all occurrences of an in banana, binary search for the first suffix starting with an and the last suffix starting with an. Everything in between is a match.
- First suffix >=
an: position 3 (ana) - Last suffix <=
an...: position 1 (anana) - Matches: positions 3 and 1, meaning
anappears at positions 1 and 3 in the original string.
The search time is O(|P| * log n) where |P| is the pattern length and n is the string length. This is independent of how many matches exist.
Handling Regex
Elhage's approach for regex on suffix arrays:
- Extract literal constraints from the regex (mandatory substrings, character ranges).
- Binary search the suffix array for each constraint, identifying candidate positions.
- Group nearby candidate positions.
- Run Google's RE2 regex engine (which guarantees linear-time matching with no catastrophic backtracking) over each group.
The Trade-offs
Advantages:
- Zero false positives for literal searches (unlike trigram indexes, which always produce some candidates that do not match).
- Handles any substring search naturally, not just trigram-decomposable patterns.
- Live demo at livegrep.com provides sub-second regex search over the entire Linux kernel.
Disadvantages:
- The entire corpus must be concatenated into a single string before building the array. For a large codebase, this means loading everything into memory.
- Storage is ~4x the corpus size (one uint32 per character position).
- Updating the index is extremely expensive: any change to any file requires re-sorting the entire array, which is O(n log n) at best.
- Sharding across multiple repositories is complex because the suffix array assumes a single contiguous string.
The fastest known suffix array construction algorithm is libdivsufsort by Yuta Mori, which livegrep uses. In 2021, Ilya Grebnov's implementation showed a 65% average performance improvement over divsufsort on the Silesia corpus.
Suffix arrays are an elegant solution for static, single-corpus search (like a fixed snapshot of the Linux kernel). They are a poor fit for dynamic, multi-repository environments where files change constantly and the index must be updated in real time. This is why no major AI coding tool uses suffix arrays as its primary search backend.
7. Probabilistic Masks: Getting to "3.5-Grams"
GitHub's Project Blackbird team, working to build a regex-capable code search for 15+ billion documents, found that plain trigram indexes were not selective enough. They needed better filtering without the storage explosion of moving to quadgrams. Their solution: augment each posting list entry with probabilistic metadata.
The Core Idea
Instead of storing just a document ID in each posting list entry, store the document ID plus two 8-bit bloom filters:
-
locMask(8 bits): Encodes the positions within the document where this trigram appears, modulo 8. If the trigramtheappears at positions 0, 8, 16, and 24 in a document, all four positions hash to bit 0 (since 0 mod 8 = 0, 8 mod 8 = 0, etc.), and only bit 0 is set. -
nextMask(8 bits): A bloom filter over the characters that follow this trigram in the document. If the trigramtheis followed by(space),n,r, andiin various locations within the document, the hashes of those four characters are OR'd into the mask.
Total overhead: 2 bytes per posting entry, on top of the document ID.
How It Helps
Approximate quadgram queries. When searching for the string them, we look up the trigram the and check each posting's nextMask to see if the bit for m is set. If not, the document cannot contain them, and we skip it. This effectively gives us 4-character selectivity using a 3-character index, because the bloom filter captures information about the fourth character.
Adjacency verification. When searching for a multi-trigram pattern, we need to verify that the trigrams actually appear next to each other in the document, not just somewhere in it. The locMask enables this: if the first trigram has bit 3 set in its locMask, the second trigram should have bit 4 set (the adjacent position). By shifting one mask and AND-ing with the other, we can probabilistically verify adjacency.
The Saturation Problem
This is where bloom filters bite back. A bloom filter with 8 bits can only encode a limited amount of information before all bits are set to 1 (saturated). Once saturated, the filter matches everything, providing no selectivity at all.
For common trigrams like the, for, or int, which appear hundreds or thousands of times in a single large source file, both the locMask and nextMask saturate almost immediately. The 8-bit filter has all bits set after ~6-8 distinct values are hashed into it. At that point, the augmented posting entry is no better than a plain document ID.
GitHub's Blackbird team tested this approach extensively and concluded that probabilistic masks "saturate too quickly to be useful" at their scale. This finding pushed them toward the sparse n-gram approach that ultimately shipped in the new GitHub Code Search.
The lesson is generalizable: bloom filters work well when the number of distinct values per filter is small (say, 2-4). They degrade rapidly as that number grows. In code search, where a single trigram can appear thousands of times in one file with many different following characters, the conditions for effective bloom filter usage simply do not hold.
8. Sparse N-Grams: The Current State of the Art
Sparse n-grams represent the most advanced approach to text indexing for regex search currently deployed in production. The algorithm is used by GitHub Code Search (processing 15+ billion documents with sub-100ms per-shard response times) and by ClickHouse for its full-text index. Cursor's March 2026 research describes building their client-side index using the same foundational approach.
The Key Insight
Traditional trigram indexing extracts every overlapping 3-character sequence from a document. This creates massive redundancy: the characters in every trigram overlap with the adjacent ones. The string HELLO produces HEL, ELL, LLO, where EL and LL are each counted twice.
Sparse n-grams take a different approach: instead of extracting fixed-size chunks at every position, extract variable-length chunks at positions determined by a weight function. The chunks can be 2 characters, 3 characters, 5 characters, or any other length. What matters is that the selection is deterministic: the same input always produces the same set of n-grams.
The Algorithm
-
Define a weight function that assigns a numerical weight to every pair of adjacent characters. The simplest choice is CRC32 over the two characters. A smarter choice (used in production) is a frequency table derived from analyzing a large corpus of open-source code, where rare character pairs get high weights and common character pairs get low weights.
-
Identify boundary positions. For a string
c_0 c_1 c_2 ... c_n, compute the pair weightw(c_i, c_{i+1})for every adjacent pair. A positioniis a boundary if its pair weight is a local maximum: higher than all pair weights between the previous boundary and the next boundary. -
Extract n-grams between boundaries. Each sparse n-gram spans from one boundary to the next. Because the boundaries are determined by the weight function, the n-grams have variable lengths.
Two Modes of Operation
build_all (indexing time): Extract every possible sparse n-gram from the document and add all of them to the index. This is more expensive than trigram extraction because it produces more tokens, but each token is more selective (rarer).
build_covering (query time): Given a query string, compute only the minimum set of sparse n-grams whose presence in the index is sufficient to verify that the query could match. This is typically much fewer n-grams than trigram decomposition would produce.
Why the Frequency-Based Weight Function Matters
This is where the algorithm gets genuinely clever.
If you use CRC32 as the weight function, the n-gram boundaries are essentially random. This provides some improvement over fixed trigrams, because the variable-length n-grams are more distinctive than trigrams, but the improvement is moderate.
If you use a frequency table (where rare character pairs get high weights), the boundaries concentrate around unusual character transitions. This means the n-grams straddle the rare parts of the text. Common substrings like the, for, or return get absorbed into the interior of longer n-grams, rather than becoming n-grams themselves.
The practical effect: the posting lists for sparse n-grams are much smaller than those for trigrams, because the n-grams are selected to be rare. And at query time, the covering algorithm finds the minimum set of rare n-grams to look up, resulting in very small candidate sets.
Why Sparse N-Grams Beat Trigrams at Scale
| Property | Trigrams | Sparse N-grams |
|---|---|---|
| Token size | Fixed (3 chars) | Variable (2-8+ chars) |
| Tokens per document | ~n-2 for doc of length n | Fewer, but more selective |
| Selectivity | Low for common trigrams | High (rare char pairs become boundaries) |
| Posting list size | Large for common trigrams | Consistently small |
| Query tokens | Many (one per 3-char window) | Few (covering algorithm) |
| Index build cost | O(n) per document | O(n) per document but more tokens |
| False positive rate | Higher | Lower |
The open-source implementation is available at danlark1/sparse_ngrams on GitHub, described as the "search index algorithm for GitHub code search."
Production Numbers
GitHub Code Search, powered by sparse n-grams:
- Scope: ~45 million repositories, 15.5 billion documents
- Index size: 25 TB for 115 TB of source content (~22% of corpus size, including all n-gram indexes and compressed content)
- Ingest rate: 120,000 documents per second
- Re-indexing time: ~18 hours for the entire corpus using delta indexing
- Per-shard P99 latency: ~100ms
- Throughput: ~640 queries/second on 64-core hosts
9. Client-Side vs. Server-Side: Where Should the Index Live?
One of the most interesting aspects of Cursor's approach is their decision to build and query these indexes locally on the user's machine, rather than on a server. This goes against the trend in the industry, where most code intelligence features have moved to the cloud. The reasoning is worth examining.
Why Cursor Chose Client-Side
Latency. Cursor's Composer model has one of the fastest tokens-per-second rates in the industry. Adding network round-trips for search, which the agent invokes constantly and often in parallel, would add friction that undermines the speed advantage of the model. A local index lookup involves reading from an mmap'd file (approximately 3.3 nanoseconds for a warm lookup vs. 416 nanoseconds for a disk seek+read). A server round-trip involves at minimum 10-50 milliseconds of network latency, thousands of times slower.
Freshness. This is perhaps the most compelling argument. When an agent writes code and then searches for what it just wrote, the search must find it immediately. Server-side indexes have inherent synchronization delays. The agent writes a file, the file needs to be synced to the server, the server needs to re-index it, and only then can the search find it. Any delay in this pipeline means the agent might search for code it just wrote and not find it, leading to confused behavior, wasted tokens, and degraded output quality.
Cursor solves freshness by keying the index to a Git commit hash and storing agent/user changes as overlay layers on top. When the agent modifies a file, the change appears immediately in the local layer without re-indexing the entire repository.
Privacy. Enterprise customers are often reluctant to send source code to external servers for indexing. Local indexes avoid this concern entirely. The code never leaves the machine.
No file synchronization needed. Server-side regex search requires either (a) syncing all files to the server so it can verify candidates against the actual text, or (b) performing expensive round-trips back and forth between client and server to fetch file contents during verification. Both options are expensive. With a local index, the files are already on disk, and verification is trivial.
How Cursor's Local Index Is Structured
Cursor uses two files:
Postings file (on disk): Contains all posting lists written sequentially during index construction. Each posting list is a sorted array of file IDs for documents that contain a particular n-gram.
Lookup table (memory-mapped): Contains a sorted array of (hash, offset) pairs. The hash is computed from the n-gram (not the full n-gram string, just its hash, saving space). The offset points into the postings file.
Only the lookup table is mmap'd into the editor process. To query, binary search the lookup table for the n-gram hash, read the offset, and fetch the posting list from the postings file at that offset. This design keeps memory usage minimal because only the compact lookup table lives in RAM, while the larger postings data stays on disk.
Note on hash collisions: Storing hashes instead of full n-gram strings means hash collisions are theoretically possible. If two different n-grams hash to the same value, their posting lists would be merged, making the combined list broader (more candidate documents). This can only produce false positives, never false negatives, which is always safe because candidates are verified against the actual text anyway. In practice, with a good hash function and 32-bit or 64-bit hashes, collisions are vanishingly rare.
The Trade-offs of Client-Side Indexing
| Aspect | Client-Side (Cursor) | Server-Side (GitHub, Sourcegraph) |
|---|---|---|
| Latency | Nanoseconds (mmap) | Milliseconds (network) |
| Freshness | Instant (overlay layers) | Delayed (sync + re-index) |
| Privacy | Code stays local | Code sent to server |
| Scale | Limited by user's machine | Scales with server cluster |
| Cross-repo search | Not practical | Natural (all repos indexed) |
| Startup cost | Must build index locally | Index pre-built on server |
| Disk usage | ~20-25% of repo size | Centralized storage |
| Multi-user sharing | Not possible | Index shared across team |
The client-side approach is optimal for the single-user, single-repo use case that Cursor primarily serves. For multi-repo search (like GitHub Code Search or Sourcegraph), server-side is the only practical option.
10. How Every Major AI Coding Tool Handles Search
The search architecture choices across the industry reveal a surprising amount of variation. No two tools have converged on the same approach.
Cursor
Approach: Hybrid. ripgrep for unindexed search, semantic vector search for conceptual queries, and (as of March 2026) local sparse n-gram indexes for accelerated regex search.
Semantic search: Server-side embedding index. Cursor published research (November 2025) showing that adding semantic search improved agent accuracy by 12.5% on average (6.5-23.5% depending on the model). The key finding: semantic search and grep together outperform either alone.
Regex indexing: Client-side sparse n-gram index keyed to Git commits with overlay layers for uncommitted changes. Two-file architecture (lookup table mmap'd, postings on disk).
Result: The most sophisticated search architecture of any AI coding tool, combining three complementary approaches.
Claude Code (Anthropic)
Approach: grep and glob only. No built-in semantic or n-gram index.
Claude Code uses ripgrep as its search backend and file globbing for navigation. There is no built-in vector search or pre-computed index of any kind. This is a purely "grep-first" architecture.
Trade-off: Simpler architecture, zero indexing overhead, works on any codebase immediately. But token-expensive on large codebases because the agent must search broadly to find relevant context, and every search scans every file.
Third-party MCP extensions exist to add semantic search to Claude Code (e.g., zilliztech/claude-context, which implements hybrid BM25 + dense vector search and reportedly reduces token usage by ~40% at equivalent retrieval quality).
GitHub Copilot
Approach: Hybrid with server-side search. Uses GitHub's own Blackbird code search infrastructure (sparse n-gram indexes) for repository-level context, plus local VS Code workspace indexing for editor-level context.
Agent mode: Boots a virtual machine, clones the repository, and uses "advanced RAG powered by GitHub code search." This means the agent has access to the most sophisticated code search infrastructure available (the same system that indexes 15.5 billion documents for github.com).
Advantage: No other tool has access to a comparable search backend. The Blackbird infrastructure took years and a dedicated team to build.
Windsurf (Codeium)
Approach: AST-based semantic indexing.
Windsurf indexes code by parsing it into Abstract Syntax Trees and creating embedding chunks at the entity level (functions, methods, classes) rather than arbitrary file/line boundaries. Embeddings are computed server-side (code is sent to Codeium's servers) and stored in a custom vector store on the user's machine.
Retrieval: Uses a technique called M-Query for context retrieval. Teams/Enterprise tiers get remote repository indexing.
Trade-off: Better semantic understanding of code structure, but requires sending code to servers for embedding computation. No regex indexing.
Augment Code
Approach: Quantized approximate nearest neighbor (ANN) search, optimized for large codebases.
Augment published detailed research on their approach: a two-phase search that first uses quantized bit-vector representations (compressing large embedding vectors into small bit vectors) for initial candidate retrieval, then runs full embedding similarity on the filtered candidates.
Performance numbers:
- 8x memory reduction (2GB to 250MB for 100M lines of code)
- Latency from 2+ seconds to under 200ms
- 99.9% accuracy parity with exact search
- 40% latency improvement for code completions
Augment explicitly found that for smaller repositories with distinctive identifiers, grep alone was sufficient and embedding-based retrieval did not improve performance. The gains from semantic search appeared primarily in large repos and conceptual queries.
Sourcegraph Cody
Approach: Hybrid dense-sparse vector retrieval, powered by Zoekt.
Sourcegraph's self-hosted instances use the Zoekt search engine (positional trigram indexes) for regex search, combined with semantic embeddings for conceptual queries. The architecture is "search-first," meaning semantic search runs before the model generates a response.
Zoekt stores byte offsets for each trigram occurrence, enabling exact adjacency verification (confirming two trigrams appear at the correct distance apart). This produces fewer false positives than position-free trigram indexes.
Performance: Sub-50ms on large codebases like Android (~2GB of text). RAM requirement: ~1.2x corpus size for positional trigrams on SSD. Zoekt shard files are memory-mapped.
Sourcegraph also provides multi-source context integration via OpenCtx (Jira, Linear, Notion, Google Docs) and MCP tools.
Cognition (Devin)
Approach: Purpose-trained retrieval models.
Cognition took a fundamentally different approach: rather than building search infrastructure, they trained specialized models for context retrieval. Their SWE-grep and SWE-grep-mini models (released November 2025) are trained with multi-turn reinforcement learning specifically for fast, parallel context retrieval.
Performance: Execute up to 8 parallel tool calls per turn, maximum 4 turns. Up to 20x faster than competing approaches for context retrieval. SWE-grep-mini runs inference at 2,800+ tokens/second (via Cerebras hardware). The models use only grep, read, and glob tools, but deploy them more strategically than a general-purpose LLM would.
Key insight: The models were trained on the observation that agents spend >60% of the first turn on context retrieval. The reward function optimizes for weighted F1 scores against ground truth file and line retrieval.
SWE-grep powers Windsurf IDE's "Fast Context" feature.
Comparison Table
| Tool | Regex Search | Semantic Search | Index Location | Indexing Approach |
|---|---|---|---|---|
| Cursor | Sparse n-gram index + ripgrep | Server-side embeddings | Client (n-gram), Server (semantic) | Sparse n-grams, overlay layers |
| Claude Code | ripgrep (no index) | None built-in | N/A | No indexing |
| GitHub Copilot | Blackbird (sparse n-grams) | Server-side | Server | Sparse n-grams, 15.5B docs |
| Windsurf | ripgrep | AST-based embeddings | Client (vectors), Server (compute) | Entity-level chunking |
| Augment Code | ripgrep | Quantized ANN search | Server | Bit-vector quantization |
| Sourcegraph Cody | Zoekt (positional trigrams) | Dense-sparse hybrid | Server (self-hosted or cloud) | Positional trigrams + embeddings |
| Devin (Cognition) | RL-trained retrieval models | Implicit (trained into model) | Model weights | Multi-turn RL optimization |
11. The Relationship Between Search Latency and Agent Quality
This is the point that ties everything together: faster search does not just save time. It makes the agent produce better code.
The Compound Effect
An AI coding agent does not search once. A typical task involves 5-20 search operations as the agent explores the codebase, finds related code, checks for existing implementations, and verifies its assumptions. If each search takes 15 seconds (which Cursor reports seeing in enterprise monorepos), a 10-search workflow takes 2.5 minutes just waiting for search results. With instant search (sub-100ms), the same workflow completes in under a second.
But the real impact is not just wall-clock time. It is about what happens inside the agent's reasoning loop.
Search Latency Shapes Agent Behavior
When search is slow, agents learn (or are prompted) to minimize search operations. They make broader, less precise queries. They guess instead of verifying. They proceed with incomplete context rather than doing another search. Every additional search call has a high cost, so the agent's behavior shifts toward fewer, larger searches rather than many targeted ones.
When search is fast, the opposite happens. The agent can afford to search frequently, refine its queries, and verify its assumptions. It can search for a function name, read the file, search for all callers of that function, read those files, and then search for the test files, all in the time it would have taken for a single slow search. The result is better context and better code.
Evidence
Cursor's research blog shows side-by-side timelines of agent workflows with "Normal Grep" versus "Instant Grep." The agent performing bug investigation with instant search completes in roughly half the time, with more search operations but shorter total duration. The key observation: the agent with fast search does more searching and produces better results, because the cost of each incremental search is negligible.
Augment Code reported a 40% latency improvement for code completions after optimizing their search infrastructure. Their quantized ANN search reduced per-query latency from 2+ seconds to under 200ms, and the quality improvement was measurable in completion accuracy.
Cognition's SWE-grep models, achieving up to 20x faster context retrieval, directly improved performance on SWE-Bench: better context retrieval led to better bug fixes, not just faster bug fixes.
The Implication
Search infrastructure is not a commodity layer that sits underneath the "real" AI work. It is a core determinant of agent quality. Investing in search performance, whether through n-gram indexes, trained retrieval models, or hybrid approaches, pays compound returns in the quality of every agent action that depends on context.
This is why Cursor is building client-side n-gram indexes despite having a working semantic search system. This is why Cognition trained purpose-built retrieval models despite having access to general-purpose LLMs. And this is why the field of code search, which might have seemed like a solved problem after GitHub Code Search shipped in 2023, is experiencing a renaissance driven by the demands of AI agents.
12. What Comes Next
The current approaches, while effective, have clear room for improvement. Several directions are worth watching.
Hybrid indexes that combine n-gram and semantic search in a single data structure. Today, these are separate systems that are queried independently and their results merged. A unified index that stores both token-level and embedding-level information could reduce storage overhead and enable queries that blend exact matching with conceptual similarity.
Learned indexes. The sparse n-gram approach uses a frequency table derived from open-source code as its weight function. This is a hand-crafted heuristic. A machine-learned weight function, trained to minimize false positive rates for real query distributions, could produce more selective n-grams. The research is early but promising, building on the broader "learned index structures" work that has shown improvements over B-trees and hash maps in database systems.
Index-aware models. Cognition's SWE-grep shows that training a model specifically for search is valuable. The next step might be models that are aware of the index structure itself: a model that knows which n-grams are in the index and can formulate queries that maximally exploit the index's selectivity, rather than defaulting to whatever grep pattern seems natural.
Incremental, streaming indexes. Current approaches build an index at a point in time (typically a Git commit) and maintain it with overlay layers for changes. A truly streaming index that updates in real time as the agent writes code, without any perceptible delay or index staleness, would eliminate the last gap between "search what the agent wrote a second ago" and "search the entire repository."
Cross-repository indexing on the client. Today, client-side indexes cover only the current repository. Enterprise developers often need context from multiple repositories (shared libraries, microservice dependencies, documentation repos). Building cross-repo indexes locally, possibly with selective syncing based on dependency graphs, could significantly improve agent context quality for enterprise use cases.
The underlying trend is clear: as AI coding agents become more capable, the quality of their context retrieval becomes the binding constraint on their performance. The models are getting smarter. The search needs to keep up.