Component-Wise RAG Prioritization for Software Engineering Tasks¶

For software engineering RAG, retriever choice influences quality more than generator choice, and BM25 is the robust default on identifier-heavy code corpora.

The counter-intuitive result¶

When a RAG pipeline underperforms, the reflex is to upgrade the generator. A component-wise study of 21+ models across the RAG pipeline reports the opposite priority for software engineering workloads: retriever choice shapes performance more than generator choice across code generation, summarization, and program repair (Ke et al., 2026).

The same study reports that classical lexical BM25 is "exceptionally robust" across the three SE tasks — competitive with or better than dense and hybrid retrievers (Ke et al., 2026).

The four component axes¶

The study splits an SE-task RAG pipeline into four independent axes and isolates each one against the others (Ke et al., 2026):

graph LR
    Q[Query] --> QP[Query Processing<br/>4 techniques]
    QP --> R[Retriever<br/>7 models: sparse/dense/hybrid]
    R --> CR[Context Refinement<br/>4 methods]
    CR --> G[Generator<br/>6 models]
    G --> O[Code Gen / Summary / Repair]

Axis	Variants	Dominance rank
Retriever	7 models — sparse (BM25), dense, hybrid	Largest effect
Query processing	4 techniques (rewriting, expansion, decomposition)	Conditional on retriever
Context refinement	4 methods (re-ranking, filtering, compression)	Conditional
Generator	6 models across capability tiers	Smaller than retriever

Component effects are not independent: query processing helps some retrievers more than others (Ke et al., 2026).

Why retrievers dominate¶

The generator can only reason about the tokens it sees. If the retriever misses the relevant function definition, error type, or test fixture, the generator cannot recover it — retrieval recall bounds generator accuracy from above. A stronger generator does nothing for documents that were never retrieved.

The companion chunking study shows the same pattern: doubling cross-file context length from 2k to 8k tokens delivers more accuracy gain than picking among non-dominated chunking strategies (Wu et al., 2026). Retrieval-side levers dominate.

Why BM25 holds up on SE tasks¶

SE retrieval queries — function names, identifiers, error messages, API symbols — share heavy lexical overlap with target documents. BM25's term-frequency signal is strong precisely when that overlap is high. Dense embeddings add value when surface forms diverge (natural-language queries against code, semantic paraphrases), but in SE-task retrieval the corpus structure often closes the term gap already (Ke et al., 2026).

Independent agentic-search evidence agrees: a tuned BM25 paired with a frontier agent reaches 83.1% answer accuracy on BrowseComp-Plus, outperforming dense-retriever search agents (Hsu, Yang, Lin, 2026).

Prioritization order¶

The findings translate to a concrete investment order for SE-task RAG:

graph TD
    A[RAG underperforms] --> B{Retriever tuned?}
    B -->|No| C[Tune retriever first<br/>BM25 baseline + b/k1 tuning]
    B -->|Yes| D{Context length<br/>and chunking set?}
    D -->|No| E[Extend context length<br/>fix chunking strategy]
    D -->|Yes| F{Query processing<br/>and refinement tested?}
    F -->|No| G[Test query rewriting<br/>and re-ranking]
    F -->|Yes| H[Then consider<br/>generator upgrade]

Retriever first — tune BM25's b and k1 against representative SE queries before assuming sophisticated retrievers help. Ke et al. reframe BM25 as a default, not a fallback (Ke et al., 2026).
Context length and chunking — pick a Pareto-optimal chunker (Sliding Window or cAST) and extend cross-file context within the model's effective range (Wu et al., 2026).
Query processing and refinement — gains depend on the retriever in use, so test against the tuned retriever, not a stock one (Ke et al., 2026).
Generator — the smallest lever in the study. Upgrade only after the retrieval side is exhausted.

When the default inverts¶

The Ke et al. findings are conditional. The opposite prioritization — generator-first, dense-retriever default — wins under these conditions:

Sub-frontier generators: weaker generators cannot compensate for retrieval-rank noise. On BrowseComp-Plus, Search-R1 + BM25 reaches 3.86% accuracy while GPT-5 + Qwen3-Embedding-8B reaches 70.1% (Chen et al., 2025). With a constrained generator budget, dense retrieval shifts the precision burden off the agent.
Natural-language-to-code retrieval: when queries are user-typed descriptions ("deduplicate a list while preserving order") and target documents are identifiers (unique_ordered), the lexical signal collapses. Dense retrieval bridges the term gap. BM25 does not.
Drifting corpora: BM25's tuned b and k1 parameters age with the codebase. Post-refactor, post-rename, or in fast-moving repos, dense embeddings re-index without parameter retuning.
Open-domain or non-SE workloads: the Ke et al. result is specific to SE corpora with high identifier-query overlap. It does not transfer to general QA, multilingual retrieval, or RAG over prose.

Outside these conditions, the retriever-first, BM25-default ordering is the higher-yield choice.

Example¶

A team runs a RAG-augmented code repair assistant over a 200k-file Python monorepo. The current pipeline uses a managed dense embedding service, a reranker, and an upgrade plan to swap the generator from a mid-tier model to a frontier one.

Before — generator-first investment, dense + reranker default:

retriever:
  primary: dense
  embedder: text-embedding-3-large
  reranker: bge-reranker-v2-m3
  top_k: 10
generator:
  current: mid-tier
  upgrade_target: frontier-tier
infrastructure:
  vector_db: managed
  monthly_cost_usd: 4200

After — retriever-first, BM25 baseline, generator upgrade deferred:

retriever:
  primary: bm25
  tuning:
    b: 0.65       # calibrated against held-out SE queries
    k1: 1.4
  top_k: 30        # higher recall, agent filters
  fallback: dense  # used only when BM25 recall is low
generator:
  current: mid-tier
  upgrade_target: deferred-until-retriever-exhausted
infrastructure:
  bm25_index: self-hosted
  monthly_cost_usd: 400

The retriever-first ordering removes the dense-stack cost, defers the generator upgrade until the retrieval side is exhausted, and matches the prioritization the Ke et al. component-wise study supports (Ke et al., 2026). Re-evaluate on held-out repair tasks before committing — the directionality is supported, but absolute deltas depend on corpus identifier-query overlap.

Key Takeaways¶

A 21+ model component-wise empirical study reports retriever choice exerts more influence on SE-task RAG quality than generator choice across code generation, summarization, and program repair (Ke et al., 2026).
BM25 is "exceptionally robust" across the three SE tasks despite being older and cheaper than dense or hybrid alternatives — a default, not a fallback, for SE corpora (Ke et al., 2026).
Investment order: tune retriever first, then context length and chunking, then query processing and refinement, then generator. Generator upgrades are the smallest lever in the study.
The result inverts when the generator is sub-frontier, queries are natural-language against code, the corpus drifts faster than BM25 parameters retune, or the workload leaves the SE-task setting.
The retriever-dominance mechanism is consistent across SE-task RAG (Ke et al., 2026), code-completion chunking (Wu et al., 2026), and frontier-agent deep-research search (Hsu, Yang, Lin, 2026) — retrieval-side levers dominate generator-side levers when the generator is strong enough to use what it sees.

Chunking Strategy for RAG-Based Code Completion — companion empirical study; chunk strategy is a smaller lever than context length, both retrieval-side
Repository-Level Retrieval for Code Generation — RACG survey covering the retrieval-strategy hierarchy (lexical, semantic, graph, hybrid)
Lexical-First Retrieval for Agentic Search — independent evidence that BM25 + frontier agent matches or beats dense retrieval in deep-research loops
Retrieval-Augmented Agent Workflows — JIT-context strategy that pairs with whichever retriever you select
Repository Map Pattern — AST + PageRank symbol selection; structural retrieval complementary to BM25 or dense
Semantic Context Loading — LSP-based retrieval bypassing chunk-based RAG entirely for symbol-level navigation
Codebase-Derived Pattern Libraries as Agent Context — narrows the retrieval corpus to vetted in-house code rather than tuning the retriever over a fixed corpus