LLM-Driven Logical Retrieval: Boolean Queries over an Inverted Index¶
A frontier LLM emits AND/OR/NOT logical queries against an inverted index — matching hybrid retrieval at scale and 41× lower indexing cost.
When This Pattern Applies¶
The pattern only holds under all four conditions:
- Frontier-capable agent LLM, able to plan multi-hop questions and author well-formed Boolean expressions. Weaker generators collapse — Search-R1 paired with BM25 reaches 3.86% accuracy on BrowseComp-Plus against the same retriever a frontier agent uses to reach 83.1% (Chen et al., 2025; Hsu, Yang, Lin, 2026).
- Lexical-overlap-rich corpus — multi-hop QA over Wikipedia-style text, code, technical documentation, log lines. Queries and target documents share surface forms. The pattern weakens when the same concept has many surface forms with no shared tokens.
- Construction cost matters — the index is rebuilt often, or indexing budget is constrained. A static, one-time index amortises hybrid's build cost to zero.
- Hallucination on unanswerable queries is a tracked metric — the boolean "no match" gives a sharper unanswerable signal than a low cosine score.
Outside these conditions, an agentic hybrid baseline retains a small accuracy edge and is the more conservative default.
The Architecture¶
LogicalRAG (Zeng et al., 2026) delegates retrieval intent to the LLM and shrinks the backend to a faithful executor of that intent:
graph LR
Q[User Question] --> A[Agent LLM]
A -->|Boolean query| L[Logical Layer<br/>AND / OR / NOT<br/>title:entity<br/>quoted phrases]
L --> I[Inverted Index]
I -->|matched set| B[BM25 Rank]
B -->|top-k docs| A
A -->|next query or answer| O[Answer or Refine]
Two execution phases: Boolean logic determines the eligible document set, then BM25 ranks within that set. The interface exposes AND, OR, NOT, quoted phrases for exact matching, and field-targeting like title:entity_name (Zeng et al., 2026).
The agent iterates: read intermediate results, refine the logical query, re-issue. The retrieval backend has no notion of semantic similarity — it only executes what the LLM authors.
Reported Results¶
| Metric | LogicalRAG | Agentic Hybrid | Source |
|---|---|---|---|
| Medium-scale accuracy (HotpotQA / 2WikiMultiHopQA / MuSiQue avg.) | 0.784 | 0.807 | Zeng et al., 2026 |
| KILT Wikipedia accuracy | 0.717 | 0.716 | Zeng et al., 2026 |
| KILT throughput (16 concurrent) | 152.5 QPS | 66.6 QPS | Zeng et al., 2026 |
| KILT mean latency | 74.9 ms | 230.5 ms | Zeng et al., 2026 |
| Index construction time | 1.27 h | 52.02 h | Zeng et al., 2026 |
| Hallucination rate (answer-unavailable) | 0.083 | 0.128 | Zeng et al., 2026 |
The headline "matches hybrid" is real at KILT scale and on cost; on medium-scale multi-hop QA the pattern trails hybrid by 2.3 accuracy points. The trade is honest only when index-rebuild cost and unanswerable-query hallucination matter as much as raw accuracy.
Why It Works¶
The pattern moves retrieval precision from the index to the query author. Hybrid retrieval pays for precision twice — at indexing time (dense embeddings, HNSW graphs, sometimes graph construction) and at query time (vector similarity fused with BM25). LogicalRAG eliminates both: the frontier LLM that already plans multi-hop questions decomposes the question into Boolean predicates over fielded terms, and the inverted index looks up rather than guesses (Zeng et al., 2026).
Hallucination reduction follows the same mechanism. A Boolean empty set is a sharp not-found signal; a low cosine score is ambiguous — "no relevant document" vs. "relevant document was paraphrased." Boolean disambiguates.
The result fits a broader retrieval-side-dominance trend: retriever choice exerts more influence than generator choice for SE-task RAG with high identifier-query overlap (Ke et al., 2026), and tuned BM25 + frontier agent matches dense retrieval on deep-research benchmarks (Hsu, Yang, Lin, 2026). When the LLM in the loop is strong, retrieval precision migrates from the retriever to the query author.
When This Backfires¶
- Sub-frontier generator — weaker LLMs cannot plan Boolean decompositions. The same BM25 index that supports 83.1% accuracy under a frontier agent supports 3.86% under Search-R1 on BrowseComp-Plus (Hsu, Yang, Lin, 2026; Chen et al., 2025). The pattern is a precision-cost migration, not a free optimisation.
- Semantic-gap queries — natural-language paraphrases against identifier-heavy documents ("deduplicate while preserving order" →
unique_ordered) have near-zero lexical overlap. Logical operators cannot bridge the gap without a thesaurus or expansion step. - Synonym-heavy corpora — medical, legal, multilingual, and consumer-product corpora where the same concept has many surface forms. BM25's insensitivity to synonymy is well documented; agents author speculative
ORchains to compensate. - Static-index, query-rate-dominated workloads — when the index is built once and serves billions of queries, the 41× build-time win amortises to zero and the medium-scale 2.3-point accuracy gap dominates.
- LLM-time dominates retrieval-time — every logical query is an inference call. On tail-latency-sensitive workloads, dense retrieval with a single round-trip can beat multi-turn Boolean refinement.
- Medium-scale multi-hop QA — the regime where the pattern trails hybrid in the source paper (0.784 vs. 0.807). A viable, cheaper baseline at this scale, not a winning one (Zeng et al., 2026).
Example¶
A team running an agentic RAG system over 10M technical-documentation pages, frontier LLM in the loop, index rebuilt nightly to track product churn.
Before — agentic hybrid with dense + BM25 fusion:
retrieval:
type: agentic-hybrid
dense:
embedder: text-embedding-3-large
vector_db: managed-hnsw
sparse:
backend: bm25
fusion: reciprocal-rank
rerank: bge-reranker-v2-m3
indexing:
nightly_build_hours: 38
monthly_infra_usd: 18000
agent:
query_pattern: free-text
After — LLM-authored Boolean queries over inverted index:
retrieval:
type: logical
backend: inverted-index
operators: [AND, OR, NOT, "quoted phrases", "field:value"]
rank: bm25
indexing:
nightly_build_hours: 0.9
monthly_infra_usd: 1100
agent:
query_pattern: boolean-logical
examples:
- 'title:"rate limit" AND (429 OR "too many requests") NOT deprecated'
- '"event_loop" AND asyncio NOT "twisted"'
The "after" configuration trades ~2 accuracy points (only at medium scale; matches at large scale) for a 42× reduction in nightly build time and a ~3× latency win at the query path. Frontier LLM authoring is preserved; the migration is the retrieval interface, not the agent. Re-evaluate hallucination rate on a held-out unanswerable-query set before committing — the 0.083 vs. 0.128 hallucination delta is the second load-bearing benefit beyond raw cost (Zeng et al., 2026).
Key Takeaways¶
- LogicalRAG moves retrieval precision from the index to the query author: a frontier LLM emits AND/OR/NOT/field-scoped queries against a plain inverted index (Zeng et al., 2026).
- The pattern matches an agentic hybrid baseline at KILT-scale Wikipedia (0.717 vs. 0.716) and trails it on medium-scale multi-hop QA (0.784 vs. 0.807); the win is cost (41× faster indexing) and hallucination rate (0.083 vs. 0.128), not raw accuracy (Zeng et al., 2026).
- The Boolean "no match" gives a sharper unanswerable signal than a low cosine score, which is why hallucination on answer-unavailable queries drops materially (Zeng et al., 2026).
- Weaker generators cannot author useful Boolean decompositions — Search-R1 + BM25 collapses to 3.86% on BrowseComp-Plus while a frontier-agent + BM25 reaches 83.1% (Hsu, Yang, Lin, 2026; Chen et al., 2025). The pattern is a precision-cost migration, not a free optimisation.
- The pattern composes with the broader retrieval-side-dominance trend: retriever choice exerts more influence than generator choice for SE-task RAG when corpora have high lexical overlap (Ke et al., 2026).
Related¶
- Component-Wise RAG Prioritization for Software Engineering Tasks — retriever-dominance mechanism with BM25 as the SE-task default
- Lexical-First Retrieval for Agentic Search — independent evidence that BM25 + frontier agent matches dense retrieval in deep-research loops
- Schema-Guided Graph Retrieval — alternative structured-retrieval interface that pushes precision onto a typed schema rather than logical operators
- Structured Domain Retrieval — knowledge-graph + case-based retrieval that captures hierarchical relationships flat vector search misses
- Retrieval-Augmented Agent Workflows — the JIT-context pattern this retrieval interface plugs into