Air-Gapped RAG: Overview and When to Use It¶
Air-gapped RAG keeps every component of the retrieval-augmented generation stack — documents, embeddings, vector store, and inference — within a network boundary you control, satisfying regulations and threat models that cloud alternatives cannot.
What "Air-Gapped" Actually Means¶
The term covers a spectrum of network isolation, not a single standard:
| Level | Description | Example use case |
|---|---|---|
| Fully offline | No network interfaces connected to any external network. Data moves only via physical media. | Classified, SCIFs, industrial control systems |
| Internal network | Connected to an internal LAN but no internet egress. Outbound traffic is blocked at the perimeter. | Regulated enterprise, air-gapped R&D labs |
| DMZ deployment | Segmented between internal and external networks by two firewall layers. Controlled inbound only. | Government contractors, healthcare portals |
Most enterprise air-gapped RAG deployments are the "internal network" variant — internet-blocked but internally reachable. Fully offline deployments require sneakernet updates and are rare outside classified environments. Per Wikipedia's definition, a true air gap requires physical isolation from any externally connected network; practitioners often use the term loosely for any on-premises deployment.
When Air-Gapped Is Required¶
Air-gapped deployment is not a preference — in the following contexts it is a compliance or legal necessity:
Regulated industries with data residency mandates
- HIPAA: Sending protected health information (PHI) to any external AI service endpoint is a covered disclosure. On-premises deployment keeps PHI within the covered entity's infrastructure.
- ITAR / FedRAMP: Organizations handling controlled unclassified information (CUI) or ITAR-regulated content are often explicitly prohibited from using commercial cloud AI services. Source: ITAR/FedRAMP requirements for CUI handling
- GDPR: Cross-border data transfers require Standard Contractual Clauses or adequacy decisions. Any query against a dataset containing PII that routes through a non-EU provider is a potential violation. Source: GDPR cross-border transfer rules
Classified and sensitive environments
- Classified environments (SECRET, TS/SCI) have no lawful cloud option — the data cannot leave the accredited enclave.
- Sensitive IP (unreleased patents, M&A due diligence, trade secrets) where even BAAs and DPAs leave residual risk.
Edge and disconnected deployments
- Devices operating without reliable internet: field equipment, vessels, remote sites, embedded systems.
- Real-time operational requirements where cloud latency (100–500ms per LLM call) is unacceptable.
When Air-Gapped Is Over-Engineering¶
Air-gapped RAG carries substantial setup and maintenance costs. Avoid it when:
- The documents contain no regulated data and your cloud provider's BAA/DPA covers applicable obligations.
- Privacy concerns are addressed by the cloud provider's contractual terms — minor sensitivity does not warrant full isolation.
- The primary motivation is control or cost, not a regulatory or threat model requirement — hybrid architectures (on-premises retrieval, private-cloud inference) often address these at lower operational cost.
- You are prototyping or in early evaluation — cloud RAG has a 1–2 day setup vs. weeks for a hardened on-premises stack.
The cost of getting this wrong runs in both directions: deploying cloud RAG when regulation requires local deployment creates legal liability; deploying air-gapped RAG when it is unnecessary creates operational overhead that compounds across every maintenance cycle.
Threat Model¶
Air-gapped RAG addresses a specific threat model. Understanding its scope prevents both under-investment and false confidence.
What it defends against
- Data exfiltration via the inference API: queries and retrieved content never leave your network boundary.
- Third-party model provider data retention: cloud providers typically retain request logs, and exact retention windows depend on the provider's DPA and contract tier — queries are part of that telemetry surface unless a zero-retention agreement is in place.
- Supply chain risk from cloud model updates: a provider can silently modify model behavior; a pinned local model version does not change without your action.
- Internet-facing attack surface: an internally-only reachable RAG system cannot be queried from the public internet.
What remains in scope
- Insider threat: a user with legitimate access to the local system can exfiltrate documents or query results.
- Embedding leakage within the perimeter: if your retrieval layer is accessible to multiple internal services, embedding similarity queries can leak document structure. Source: Privacy-Aware RAG, arxiv 2503.15548
- Physical access: fully offline systems are vulnerable to physical media attacks.
- Prompt injection through ingested documents: malicious content in the document corpus can manipulate retrieval and generation — isolation from the internet does not neutralize this vector. See Prompt Injection Threat Model.
Cost and Quality Tradeoffs¶
Setup and maintenance burden¶
A cloud RAG stack (OpenAI embeddings + Pinecone + GPT-4) can be functional in a day. An equivalent air-gapped stack requires: hardware procurement or VM provisioning, OS hardening, container orchestration, local model download and serving configuration, vector store setup, document pipeline construction, and an operational runbook. The ICSA 2026 on-premises RAG blueprint estimates the architecture at six distinct service components with separate scaling and maintenance profiles. Source: On-Premises RAG Blueprint, arxiv 2604.01395
Model quality¶
The quality gap between local and cloud models has narrowed substantially:
- Embeddings:
nomic-embed-text-v1.5(137M parameters, fully open-source) matches or exceeds OpenAI'stext-embedding-ada-002andtext-embedding-3-smallon MTEB short and long-context benchmarks. Source: Nomic AI - Generation: 7B–13B parameter models on consumer or workstation GPUs produce acceptable results for structured document Q&A. A 7B Q4-quantized model requires approximately 5GB VRAM at 4K context. Source: Local LLM guide
- Remaining gap: Frontier tasks — multi-step reasoning, cross-document synthesis, ambiguous queries — still show a measurable quality gap vs. GPT-4 class cloud models. For well-scoped enterprise document Q&A, this gap is often acceptable.
Hardware cost¶
| Scale | Minimum hardware | Approximate cost |
|---|---|---|
| Single developer / prototype | 16GB RAM, 8GB VRAM GPU | ~$1,500 (GPU) |
| Small team (10 concurrent users) | 64GB RAM, 24GB VRAM GPU | ~$5,000–$8,000 |
| Production (100+ concurrent users) | Multi-GPU server, NVMe storage | Multi-GPU server class — confirm with vendor quotes before budgeting |
Hardware costs are one-time but maintenance, power, and operations are ongoing. Compare against cloud API costs at expected query volume before committing.
Pathway Overview¶
This module is the opening unit of a nine-module series. Each module is a 60–90 minute hands-on session covering one layer of the stack:
- Overview and When to Use It ← this module
- Architecture Fundamentals — components, data flow, deployment topology
- Document Ingestion and Parsing — PDF, Word, HTML at scale without cloud OCR
- Chunking Strategies — fixed, semantic, hierarchical, and their retrieval tradeoffs
- Local Embeddings and Vector Stores — model selection, ChromaDB, LanceDB, Milvus
- Retrieval and Re-Ranking — BM25, dense retrieval, hybrid, cross-encoders
- Local LLM Inference — Ollama, vLLM, llama.cpp, hardware sizing
- Grounding, Citations, and Evaluation — source attribution, faithfulness scoring, evals
- Deployment, Operations, and Compliance — logging, access control, audit trails
All modules use only locally-runnable tools. No cloud API calls appear anywhere in the series.
Example¶
A legal firm stores client contracts and case documents. Counsel wants to query this corpus using natural language. The constraints: attorney-client privilege prohibits third-party processing; ABA Formal Opinion 512 and state-bar guidance treat sending client confidences to a third-party generative AI tool as a disclosure event that requires informed consent and adequate safeguards.
The deployment is a single Haystack pipeline (matches the series reference stack):
- Framework:
haystack-ai2.x — onePipelinewires every stage and serializes to a YAML file bar-association counsel can audit alongside the firm's other written policies - Document ingestion:
PyPDFToDocumentandDOCXToDocumentconverters for the baseline,doclingfor complex filings, all running in a Hayhooks container - Embeddings:
nomic-embed-text-v1.5viaSentenceTransformersDocumentEmbedder, weights pre-downloaded to the firm's on-premises server - Vector store: Qdrant in local persistent mode with both dense and sparse vectors enabled (
QdrantDocumentStore) - LLM inference:
qwen2.5:7b(Q4_K_M) served by Ollama, invoked viaOllamaGenerator— no internet access configured at the OS level - Evaluation:
FaithfulnessEvaluatorbacked by the same local LLM, run nightly against a golden query set of prior attorney-reviewed answers
A query like "Which contracts include arbitration clauses expiring before 2027?" flows through the Haystack pipeline entirely on the firm's network. The pipeline YAML is version-controlled in the firm's document-management system; each signed container release ships with a matching SBOM so the firm's IT auditor can map every dependency to an approved software list.
Key Takeaways¶
- "Air-gapped" covers a spectrum: fully offline, internal-network-only, and DMZ deployments each have different operational profiles and appropriate use cases.
- Air-gapped is required when regulation explicitly prohibits third-party data processing (HIPAA PHI, ITAR CUI, classified environments) or when the threat model demands it — not as a default for privacy preferences.
- The quality gap between local and cloud models has narrowed:
nomic-embed-text-v1.5matches OpenAI's older embedding models on MTEB; 7B parameter models are viable for structured document Q&A. - Setup and maintenance costs are substantially higher than cloud RAG — the operational overhead compounds. Quantify this before committing.
- Air-gapped isolation does not neutralize all vectors: insider threat, prompt injection through ingested documents, and embedding leakage within the perimeter remain in scope.