Question Answering & Code Completion Over Large Codebases

NAME

qa-over-large-code – Techniques and frameworks for retrieval-based question answering and code completion across large software repositories that exceed typical LLM context windows.

SYNOPSIS

Modern large language models (LLMs) are limited by fixed context windows (from 4K to ~100K tokens). Real-world codebases, however, can span tens or hundreds of thousands of lines. This blog post covers how to overcome these limitations using:

Retrieval-Augmented Generation (RAG) – Dynamically retrieving relevant code or documentation.
Hierarchical Chunking & Multi-Stage Retrieval – Layered indexing to narrow from broad documents to specific segments.
Summarization – Generating compressed representations of code at function, file, or repository level.
Code-Aware Embeddings – Specialized models (e.g., CodeBERT, GraphCodeBERT) for semantic search.
Indexing Strategies – Inverted (keyword) vs. vector-based (semantic) approaches, or hybrid combos.
Real-World Implementations – Tools like Sourcegraph Cody, Pieces Copilot, and Amazon’s Repoformer, plus open frameworks such as LlamaIndex and LangChain.

Throughout, we highlight trade-offs in coverage, precision, language support, and performance.

DESCRIPTION

1. Large Codebases & LLM Context Limits

LLMs cannot simply “read” an entire repository into a single prompt because of token limits. As codebases scale, even a 100K-token model might not suffice. Hence, we rely on retrieval or summarization to selectively provide only the relevant parts. This concept is sometimes called RAG (Retrieval-Augmented Generation) when the LLM consults external knowledge sources in real time.

2. Retrieval-Augmented Generation (RAG) for Code

RAG frameworks augment an LLM by retrieving relevant documents (code snippets, docs, comments) and injecting them into the prompt. For code tasks:

QA: “Where is user authentication logic defined?” → The system finds the relevant file or function and includes it in the prompt for the LLM to answer.
Completion: “Suggest code to parse JSON” → The system retrieves a code snippet that does similar parsing and feeds it into the generation model (e.g. Microsoft’s ReACC approach).

By grounding the answer or completion in actual repository data, the model reduces hallucinations and provides more accurate results. Research like REDCODER has shown substantial gains in code generation and summarization through RAG.

3. Hierarchical Chunking & Multi-Stage Retrieval

Single-step retrieval from a massive codebase may still produce large or off-target chunks. Hierarchical retrieval solves this by searching in layers:

First pass: Identify the most relevant file or module (coarse retrieval).
Second pass: Within that module, fetch specific functions or code blocks (fine retrieval).

Multi-hop questions—those requiring references from multiple parts of the code—can be answered by iteratively retrieving new context based on partial answers (similar to multi-hop QA in NLP). Tools like Sourcegraph Cody often combine multiple search methods (keyword and semantic) to ensure both precision and recall.

4. Summarization of Code for Context Compression

Another way to squeeze repository knowledge into an LLM’s context is via summaries:

Function-Level Summaries: Briefly describe what each function does, its parameters, and returns.
File or Package Summaries: Aggregates function summaries into a broader overview.

A hierarchical summarization pipeline can bubble up details from lower-level units to produce a repository-wide overview. This helps the model quickly locate relevant parts without retrieving raw code. However, summaries risk omitting important details (e.g., exact regex patterns). Tools must balance high-level context with lower-level code retrieval when the user’s query requires specifics.

5. Indexing & Vector Databases for Code

To implement RAG or summarization effectively, you need an index:

Keyword / Inverted Index: Classic search engine approach. Fast, exact matching of tokens or identifiers. However, it may miss semantic matches if the code uses different naming than the query.
Vector / Semantic Index: Embeds code chunks into vectors (via e.g. CodeBERT, text-embedding-ada) and retrieves nearest neighbors to a query vector. This captures conceptual similarity, not just keywords.

Many real-world solutions use a hybrid approach (keyword + embeddings) for the best coverage. Sourcegraph’s early attempts at code embeddings revealed scaling issues for very large monorepos, while Amazon CodeWhisperer’s Repoformer sometimes prefers a simple token-overlap measure (Jaccard) for retrieval. Ultimately, the best approach depends on codebase size, language diversity, and the environment’s performance constraints.

6. Code-Aware Embeddings & Semantic Representations

Code-aware embedding models (CodeBERT, GraphCodeBERT, CodeT5, UniXCoder) go beyond plain text embeddings by learning code syntax and semantics. They can recognize function calls, dataflow, or cross-language similarities:

Intra-language advantage: a Java function to parse JSON might embed closely to another Java function for parsing JSON, even if variable names differ.
Cross-language advantage: some models can align Python and Java code that perform the same logic, aiding multi-language codebases.

The downside is increased complexity (and compute overhead) plus the need to keep embeddings up to date as code changes. Some industrial teams find simpler solutions (like token overlap) surprisingly competitive.

7. Real-World Systems & Tools

Sourcegraph Cody: Integrates with Sourcegraph’s fast code search (keyword-based) to fetch relevant snippets, then feeds them to an LLM. They experimented with embeddings but found them costly at large scale. Cody exemplifies multi-stage retrieval combining lexical and (sometimes) semantic signals.

Pieces Copilot: Uses a purely local approach for privacy: segments code heuristically, embeds each segment, stores vectors, and retrieves top matches for QA or completions. Summaries of chat history allow multi-turn Q&A.

Amazon CodeWhisperer (Repoformer): Shows “selective retrieval” for code completion. The model decides if it even needs external context, thereby optimizing for latency. Simple token-overlap matching often outperformed advanced embedding methods in their tests.

Open-Source Frameworks:

LlamaIndex – Build hierarchical or simple indexes over code, then route queries through an LLM with retrieved context.
LangChain – Provides a chain-based approach for retrieval + generation. You can combine code embeddings, vector stores, and multi-step queries.
Continue.dev, Tabby – Open-source code assistants that embed or parse code, retrieving relevant parts for an LLM-based completion or Q&A.

8. Trade-offs & Practical Considerations

Granularity vs. Token Budget: Function-level or class-level chunking is a common sweet spot. Overly large chunks risk blowing up the context window; tiny chunks may lose context.

Summaries vs. Raw Code: Summaries accelerate broad Q&A but might omit details crucial for debugging or code generation. A system should be able to fall back on actual code if the question demands specifics.

Language-Agnostic vs. Language-Aware: Heuristic splitting and generic text embeddings let you handle any language. AST-based chunking and specialized code embeddings yield deeper understanding but require per-language tooling.

Preprocessing Overhead: Indexing and embedding large repos can take hours; once built, queries run faster. If code changes frequently, you need to re-index or re-embed incrementally.

Long Context Models vs. Retrieval: 100K- to 1M-token contexts help, but still do not scale to monstrous codebases. Retrieval remains key to focusing the model on only the relevant parts.

Confidence & Evaluation: An LLM that lacks correct context may hallucinate. Systems should show the retrieved snippets or references so developers can verify. Enterprise teams often build labeled internal datasets to evaluate retrieval and correctness.

USAGE

1. Choose a chunking strategy:
   - Function-level, class-level, or file-level segmentation.
   - Consider AST-based or heuristic (e.g., indentation) chunk boundaries.

2. Build an index (keyword or vector, or both):
   - Use a tool like Sourcegraph for lexical indexing or a vector DB (FAISS, Pinecone) for embeddings.
   - If using code-aware embeddings (CodeBERT, CodeT5), ensure your languages are supported.

3. Retrieval logic:
   - Single-step retrieval if queries typically map to one code snippet.
   - Multi-stage (hierarchical) retrieval for large or multi-hop queries.
   - Summaries can route queries to specific subdirectories, then fetch raw code.

4. Integrate with an LLM:
   - RAG prompt structure: "User question + retrieved snippets" → LLM answers.
   - For code completion, embed partial code → retrieve similar code → feed into completion model.

5. (Optional) Summarize:
   - Precompute multi-level summaries. 
   - If the user asks a broad question, consult summaries first, then refine retrieval if needed.

6. Handle updates & increments:
   - Periodically re-embed or re-index changed code.
   - Use incremental indexing if available (e.g., watch your VCS for diffs).

7. Evaluate:
   - Keep a reference set of QA or completion queries. 
   - Check retrieval coverage, LLM answer accuracy, developer feedback.

CONCLUSION

As large language models become a core part of software development workflows, organizations and open-source communities face a fundamental question: how to incorporate codebases far exceeding the model’s context window?

The answer lies in retrieval-based techniques (RAG, hierarchical searching, embedding-based or keyword-based indexes), intelligent chunking, and possibly repository-level summaries. These methods let the model “see” only what it needs and drastically reduce hallucinations while boosting accuracy. Real-world solutions like Sourcegraph Cody, Pieces Copilot, and Amazon’s CodeWhisperer demonstrate the effectiveness of combining fast search or embeddings with LLM inference.

There is no universal solution. The right approach depends on codebase size, language diversity, performance requirements, and developer workflow preferences. Yet the principles are consistent: index, retrieve, and present the LLM with just enough relevant context. This synergy of search and generation will remain central as codebases continue to grow, and as context windows—though expanding—still rarely match the full scale of enterprise repositories.