qa-over-large-code – Techniques and frameworks for retrieval-based question answering and code completion across large software repositories that exceed typical LLM context windows.
Modern large language models (LLMs) are limited by fixed context windows (from 4K to ~100K tokens). Real-world codebases, however, can span tens or hundreds of thousands of lines. This blog post covers how to overcome these limitations using:
Throughout, we highlight trade-offs in coverage, precision, language support, and performance.
LLMs cannot simply “read” an entire repository into a single prompt because of token limits. As codebases scale, even a 100K-token model might not suffice. Hence, we rely on retrieval or summarization to selectively provide only the relevant parts. This concept is sometimes called RAG (Retrieval-Augmented Generation) when the LLM consults external knowledge sources in real time.
RAG frameworks augment an LLM by retrieving relevant documents (code snippets, docs, comments) and injecting them into the prompt. For code tasks:
By grounding the answer or completion in actual repository data, the model reduces hallucinations and provides more accurate results. Research like REDCODER has shown substantial gains in code generation and summarization through RAG.
Single-step retrieval from a massive codebase may still produce large or off-target chunks. Hierarchical retrieval solves this by searching in layers:
Multi-hop questions—those requiring references from multiple parts of the code—can be answered by iteratively retrieving new context based on partial answers (similar to multi-hop QA in NLP). Tools like Sourcegraph Cody often combine multiple search methods (keyword and semantic) to ensure both precision and recall.
Another way to squeeze repository knowledge into an LLM’s context is via summaries:
A hierarchical summarization pipeline can bubble up details from lower-level units to produce a repository-wide overview. This helps the model quickly locate relevant parts without retrieving raw code. However, summaries risk omitting important details (e.g., exact regex patterns). Tools must balance high-level context with lower-level code retrieval when the user’s query requires specifics.
To implement RAG or summarization effectively, you need an index:
Many real-world solutions use a hybrid approach (keyword + embeddings) for the best coverage. Sourcegraph’s early attempts at code embeddings revealed scaling issues for very large monorepos, while Amazon CodeWhisperer’s Repoformer sometimes prefers a simple token-overlap measure (Jaccard) for retrieval. Ultimately, the best approach depends on codebase size, language diversity, and the environment’s performance constraints.
Code-aware embedding models (CodeBERT, GraphCodeBERT, CodeT5, UniXCoder) go beyond plain text embeddings by learning code syntax and semantics. They can recognize function calls, dataflow, or cross-language similarities:
The downside is increased complexity (and compute overhead) plus the need to keep embeddings up to date as code changes. Some industrial teams find simpler solutions (like token overlap) surprisingly competitive.
Sourcegraph Cody: Integrates with Sourcegraph’s fast code search (keyword-based) to fetch relevant snippets, then feeds them to an LLM. They experimented with embeddings but found them costly at large scale. Cody exemplifies multi-stage retrieval combining lexical and (sometimes) semantic signals.
Pieces Copilot: Uses a purely local approach for privacy: segments code heuristically, embeds each segment, stores vectors, and retrieves top matches for QA or completions. Summaries of chat history allow multi-turn Q&A.
Amazon CodeWhisperer (Repoformer): Shows “selective retrieval” for code completion. The model decides if it even needs external context, thereby optimizing for latency. Simple token-overlap matching often outperformed advanced embedding methods in their tests.
Open-Source Frameworks:
Granularity vs. Token Budget: Function-level or class-level chunking is a common sweet spot. Overly large chunks risk blowing up the context window; tiny chunks may lose context.
Summaries vs. Raw Code: Summaries accelerate broad Q&A but might omit details crucial for debugging or code generation. A system should be able to fall back on actual code if the question demands specifics.
Language-Agnostic vs. Language-Aware: Heuristic splitting and generic text embeddings let you handle any language. AST-based chunking and specialized code embeddings yield deeper understanding but require per-language tooling.
Preprocessing Overhead: Indexing and embedding large repos can take hours; once built, queries run faster. If code changes frequently, you need to re-index or re-embed incrementally.
Long Context Models vs. Retrieval: 100K- to 1M-token contexts help, but still do not scale to monstrous codebases. Retrieval remains key to focusing the model on only the relevant parts.
Confidence & Evaluation: An LLM that lacks correct context may hallucinate. Systems should show the retrieved snippets or references so developers can verify. Enterprise teams often build labeled internal datasets to evaluate retrieval and correctness.
1. Choose a chunking strategy: - Function-level, class-level, or file-level segmentation. - Consider AST-based or heuristic (e.g., indentation) chunk boundaries. 2. Build an index (keyword or vector, or both): - Use a tool like Sourcegraph for lexical indexing or a vector DB (FAISS, Pinecone) for embeddings. - If using code-aware embeddings (CodeBERT, CodeT5), ensure your languages are supported. 3. Retrieval logic: - Single-step retrieval if queries typically map to one code snippet. - Multi-stage (hierarchical) retrieval for large or multi-hop queries. - Summaries can route queries to specific subdirectories, then fetch raw code. 4. Integrate with an LLM: - RAG prompt structure: "User question + retrieved snippets" → LLM answers. - For code completion, embed partial code → retrieve similar code → feed into completion model. 5. (Optional) Summarize: - Precompute multi-level summaries. - If the user asks a broad question, consult summaries first, then refine retrieval if needed. 6. Handle updates & increments: - Periodically re-embed or re-index changed code. - Use incremental indexing if available (e.g., watch your VCS for diffs). 7. Evaluate: - Keep a reference set of QA or completion queries. - Check retrieval coverage, LLM answer accuracy, developer feedback.
As large language models become a core part of software development workflows, organizations and open-source communities face a fundamental question: how to incorporate codebases far exceeding the model’s context window?
The answer lies in retrieval-based techniques (RAG, hierarchical searching, embedding-based or keyword-based indexes), intelligent chunking, and possibly repository-level summaries. These methods let the model “see” only what it needs and drastically reduce hallucinations while boosting accuracy. Real-world solutions like Sourcegraph Cody, Pieces Copilot, and Amazon’s CodeWhisperer demonstrate the effectiveness of combining fast search or embeddings with LLM inference.
There is no universal solution. The right approach depends on codebase size, language diversity, performance requirements, and developer workflow preferences. Yet the principles are consistent: index, retrieve, and present the LLM with just enough relevant context. This synergy of search and generation will remain central as codebases continue to grow, and as context windows—though expanding—still rarely match the full scale of enterprise repositories.
Key Papers & Research:
Tools & Frameworks: