Air-Gapped RAG: Local Embeddings and Vector Stores¶
Choose, run, and persist embeddings entirely on your hardware — no cloud API calls, no vendor lock-in.
This is the "no cloud dependencies" stage of the air-gapped RAG pipeline. Every embedding is generated locally; every vector lives in local storage. The main decision points are which embedding model to run and which vector store to use. Both involve real trade-offs between retrieval quality, storage cost, compute, and operational complexity.
Runnable code wires embedders and document stores into a Haystack indexing pipeline. Every embedder is a SentenceTransformersDocumentEmbedder variant; every store is a Haystack DocumentStore via its dedicated integration package. Swapping any of them is a one-component edit.
Embedding Models¶
sentence-transformers is the standard Python library for local embedding inference. It wraps any Hugging Face model with a two-line API and handles tokenization, batching, and device placement. Haystack's SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder are thin Component wrappers around it — install both with pip install haystack-ai sentence-transformers.
Model Comparison¶
Five families cover most air-gapped deployments:
| Model | Dims | Context | Notes |
|---|---|---|---|
| bge-large-en-v1.5 | 1024 | 512 | Top MTEB for its size class; English-only |
| nomic-embed-text-v1.5 | 768 | 8192 | Long context; Matryoshka support; Apache 2.0 |
| e5-large-v2 | 1024 | 512 | Strong retrieval; requires instruction prefix |
| e5-mistral-7b-instruct | 4096 | 32768 | Highest MTEB scores; requires ~14GB VRAM |
| multilingual-e5-large | 1024 | 512 | 100+ languages; use when corpus is multilingual |
Sources: BAAI/bge-large-en-v1.5 paper (arXiv 2309.07597), E5 model variants (microsoft/unilm).
Decision rule: start with nomic-embed-text-v1.5 (the series reference stack) for English corpora on CPU or modest GPU. bge-large-en-v1.5 is a strong alternative when 1024 dimensions are acceptable and you do not need long context. Use e5-mistral-7b-instruct only when retrieval quality is the primary constraint and 14GB+ VRAM is available.
Dimensionality Trade-offs¶
Higher dimensions store more semantic information and typically improve retrieval quality — but at a direct cost to storage and compute.
| Dims | Storage per 1M docs | Compute | Use case |
|---|---|---|---|
| 384 | ~1.5 GB | Fastest | Prototyping, resource-constrained edge |
| 768 | ~3 GB | Fast | Standard production |
| 1024 | ~4 GB | Moderate | Quality-sensitive retrieval |
| 4096 | ~16 GB | Slow | Maximum quality, GPU required |
Storage figures are for float32 vectors. Using float16 (half precision) halves storage and delivers retrieval recall within a fraction of a percent of float32 on standard benchmarks — see SingleStore's float16 vector type benchmark reporting near-identical recall and the CoRECT evaluation framework (arXiv 2510.19340) for a broader compression comparison.
Matryoshka Embeddings¶
Some models support Matryoshka Representation Learning (MRL), which encodes information at coarse-to-fine granularities within a single vector. You can truncate the embedding to a smaller dimension at query time with minimal quality degradation — for example, using 256 dimensions instead of 768 for a 3x storage reduction.
nomic-embed-text-v1.5 explicitly supports MRL. Haystack exposes the underlying sentence-transformers options via model_kwargs:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
embedder = SentenceTransformersDocumentEmbedder(
model="nomic-ai/nomic-embed-text-v1.5",
truncate_dim=256, # reduce from 768 to 256 for storage savings
model_kwargs={"trust_remote_code": True},
device="cpu",
)
embedder.warm_up() # load the model weights before first use
Matching the dimension on the QdrantDocumentStore side is mandatory — truncate to 256 at embed time, create the collection with embedding_dim=256. Changing the truncate_dim mid-pipeline invalidates the entire index.
Vector Stores¶
Five stores cover the local deployment spectrum from simple to production-ready:
| Store | Deployment | Persistence | Metadata Filter | Hybrid Search | Notes |
|---|---|---|---|---|---|
| Chroma | In-process | SQLite | Yes | No | Simplest dev experience |
| Qdrant | In-process or server | On-disk | Yes | Yes (sparse+dense) | Best filtering granularity; series reference |
| LanceDB | In-process | Lance columnar | Yes | Yes (FTS+vector) | Lowest disk footprint per vector |
| FAISS | Library only | Manual serialization | No | No | Fastest raw ANN; no built-in ops |
| Weaviate | Docker | On-disk | Yes | Yes (BM25+vector) | Heaviest operational footprint |
Qdrant (series reference stack)¶
Qdrant runs in-process via its Python client and is the vector store the rest of this series builds on. Hybrid retrieval (Module 6) and the deployment stack (Module 9) both assume a Qdrant collection. Haystack wraps it via qdrant-haystack — install with pip install qdrant-haystack:
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
document_store = QdrantDocumentStore(
path="/data/qdrant_db", # local persistent mode; pass url="..." for server mode
index="documents",
embedding_dim=768, # matches nomic-embed-text-v1.5
use_sparse_embeddings=True, # enables dense + sparse hybrid retrieval in Module 6
recreate_index=False, # never clobber an existing index by accident
)
Qdrant supports sparse vectors alongside dense vectors, enabling hybrid BM25-equivalent + semantic search without a separate keyword index. Filtering uses structured JSON payload fields with must, should, and must_not clauses — the most expressive filtering API in this group. Haystack surfaces the full filter language through the retriever's filters parameter.
Best for: deployments that need production-level filtering, hybrid retrieval, or will eventually scale to a server deployment. Configure use_sparse_embeddings=True at QdrantDocumentStore construction time — adding sparse vectors later requires a collection migration outside Haystack.
Chroma¶
Chroma runs in-process with zero configuration. Haystack wraps it via chroma-haystack:
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
document_store = ChromaDocumentStore(
persist_path="/data/chroma_db",
collection_name="documents",
distance_function="cosine",
)
Best for: prototyping and single-machine deployments where operational simplicity matters more than query throughput. Note: ChromaDocumentStore does not expose sparse vectors, so hybrid retrieval in Module 6 requires wrapping a keyword index (e.g., InMemoryBM25Retriever populated from the same documents) alongside the Chroma store and fusing with DocumentJoiner.
LanceDB¶
LanceDB stores vectors in the columnar Lance format, which enables zero-copy reads and efficient analytics queries over the embedding corpus. A community-maintained lancedb-haystack package is listed on the Haystack integrations catalog — it is not a first-party deepset integration, so track its release cadence before depending on it in production.
Best for: large corpora where storage efficiency matters, or when you need SQL-style filtering over metadata. For a Haystack-native alternative with similar strengths, use Qdrant's server mode with the payload_index_selector configured for your hot metadata fields.
FAISS¶
FAISS is a library for approximate nearest-neighbor search — not a database. It provides the fastest raw vector search (IVF, HNSW, PQ indexes) but has no built-in metadata storage, no update/delete support, and no persistence API.
Haystack does not wrap FAISS as a first-class DocumentStore in 2.x — the older FAISSDocumentStore from Haystack 1.x is not available in 2.x. Use InMemoryDocumentStore for small corpora, or Qdrant for anything larger; FAISS is only worth the wrapping cost when raw search throughput is the dominant constraint.
Example: Full Indexing Pipeline¶
The reference stack's indexing pipeline, end-to-end. This is the code you run once per corpus refresh — parse, clean, chunk, embed (dense + sparse), and write to Qdrant:
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import (
SentenceTransformersDocumentEmbedder,
SentenceTransformersSparseDocumentEmbedder,
)
from haystack.components.writers import DocumentWriter
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
document_store = QdrantDocumentStore(
path="/data/qdrant_db",
index="documents",
embedding_dim=768, # nomic-embed-text-v1.5
use_sparse_embeddings=True,
recreate_index=False,
)
indexing = Pipeline()
indexing.add_component("converter", PyPDFToDocument())
indexing.add_component("cleaner", DocumentCleaner(
remove_empty_lines=True,
remove_extra_whitespaces=True,
))
indexing.add_component("splitter", DocumentSplitter(
split_by="sentence", split_length=5, split_overlap=1,
))
indexing.add_component("dense_embedder", SentenceTransformersDocumentEmbedder(
model="nomic-ai/nomic-embed-text-v1.5",
device="cpu",
model_kwargs={"trust_remote_code": True},
))
indexing.add_component("sparse_embedder", SentenceTransformersSparseDocumentEmbedder(
model="prithivida/Splade_PP_en_v1",
device="cpu",
))
indexing.add_component("writer", DocumentWriter(document_store=document_store))
indexing.connect("converter.documents", "cleaner.documents")
indexing.connect("cleaner.documents", "splitter.documents")
indexing.connect("splitter.documents", "dense_embedder.documents")
indexing.connect("dense_embedder.documents", "sparse_embedder.documents")
indexing.connect("sparse_embedder.documents", "writer.documents")
# Run against a directory of PDFs
from pathlib import Path
pdfs = [str(p) for p in Path("corpus").glob("*.pdf")]
indexing.run({"converter": {"sources": pdfs}})
# Serialize the pipeline for audit
with open("pipelines/indexing.yaml", "w") as f:
f.write(indexing.dumps())
Dimension 768 matches nomic-embed-text-v1.5. If you swap to bge-large-en-v1.5, change embedding_dim on the QdrantDocumentStore to 1024 and re-create the index — embedding dimension is the single tightest coupling in the pipeline. Both values must move together, or retrieval silently returns nonsense.
Key Takeaways¶
- Haystack's
SentenceTransformersDocumentEmbedder+SentenceTransformersTextEmbedderpair wraps all major open embedding models with no cloud dependencies; the split lets you add query-side instruction prefixes without touching the indexing side - Start with
nomic-embed-text-v1.5for balanced quality/compute and long context;bge-large-en-v1.5is a strong alternative for English-only at 1024 dimensions; reservee5-mistral-7b-instructfor GPU-rich deployments where retrieval quality dominates - Matryoshka-capable models (nomic-embed-text-v1.5) let you trade storage for speed at index time via
truncate_dim, but the dimension choice is permanent — retruncating invalidates the index QdrantDocumentStorewithuse_sparse_embeddings=Trueis the reference stack;ChromaDocumentStoreis simpler but drops sparse support; Weaviate, Elasticsearch, and PGVector are available as first-party Haystack integrations when the deployment constraints demand them- FAISS is not a Haystack 2.x first-class store —
InMemoryDocumentStoreis the lightweight path for prototyping, Qdrant for anything production-shaped - Haystack pipelines serialize to YAML — the full indexing pipeline above becomes ~40 lines of
pipelines/indexing.yamlthat a security reviewer can audit end-to-end