Semantic Cache — How It Really Works
A semantic cache stores the result of a query and retrieves it again when a semantically similar (not necessarily identical) query comes in. Unlike a key-value cache where the key must match exactly, semantic cache uses vector similarity to find near-matches.
Core Mechanism
# Pseudocode — semantic cache lookup
def query_with_cache(user_query):
q_emb = embed(user_query)
# search the cache (it's a small vector store)
hit = cache_index.search(q_emb, top_k=1)
if hit.score >= 0.95: # threshold
return hit.cached_response # 0 LLM cost, ~5ms
# cache miss — run full pipeline
response = full_rag_pipeline(user_query, q_emb)
cache_index.upsert(q_emb, response) # store for next time
return response
0.95+ = very strict, only near-identical queries match. 0.85 = broader match, risk of wrong cached answer. For factual RAG: stay at 0.92–0.95. For conversational bots: 0.88–0.92 is often fine.
The Big Question: Does Chat History Break Semantic Cache?
Yes — if you're naive about it. If you cache raw user messages without context, you'll return wrong cached answers. The same message "What is the refund period?" means different things depending on chat history.
There are two strategies:
Strategy A — Cache Condensed Standalone Queries
Convert the multi-turn history into a single standalone query first (LLM rewrite), then cache that. The cache key is the standalone query embedding, not the raw user message.
standalone = llm("""Given chat history:
User: "What products do you sell?"
Bot: "We sell SaaS subscriptions..."
User: "What is the refund period?"
Rewrite the last question as a fully standalone question.""")
# → "What is the refund period for your SaaS subscriptions?"
response = query_with_cache(standalone)
Strategy B — Cache Only User-Independent Queries
Don't cache personal, session-specific, or history-dependent queries at all. Only cache queries that are truly universal: "What is the company's vacation policy?" is cacheable. "What is MY leave balance?" is not.
Where Semantic Cache Actually Helps — Real Scenarios
FAQ Repetition
"How do I reset my password?" vs "forgot my password steps" vs "change password guide" — all hit the same cached answer. Massive win for support bots.
Policy Questions
"What is the PTO policy?" / "How many vacation days do employees get?" → same cache hit. 40–60% of enterprise knowledge base queries are variations of the same few dozen questions.
Product Discovery
"Cheapest laptop under 50k" vs "budget laptops below 50000 rupees" → same answer. Cache works beautifully here.
Definition Queries
"What is hypertension?" vs "define high blood pressure" → safe to cache. But "Should I take metoprolol?" is user-specific — never cache.
Common Patterns
"How to read a file in Python?" asked by 1000 developers. Cache once, serve all. Cost savings are enormous.
Report Queries
"Show last month's revenue" — if it was expensive to compute, cache for 30 minutes. Time-bounded cache (TTL) + semantic similarity.
Cache Invalidation — When to Bust
| Trigger | Action | Example |
|---|---|---|
| New document indexed | Invalidate related cache entries | New policy doc → bust policy-related cache |
| Document updated | Tag-based invalidation | Price changes → bust product Q&A cache |
| TTL expiry | Time-based expiry | News/events: cache for 1 hour only |
| User feedback "wrong answer" | Delete specific cache entry | User flags incorrect answer |
Semantic Cache vs Exact Cache
Exact Cache (Redis KV)
- Key = exact query string
- Hit only on identical queries
- O(1) lookup, near-zero overhead
- Useless for natural language variation
- Great for structured queries (API calls, SQL)
Semantic Cache
- Key = query embedding vector
- Hit on semantically similar queries
- ANN lookup, ~5–15ms overhead
- Handles natural language well
- Risk: false positive hits at low thresholds
In production: Use both. Exact cache (Redis) as first check (~0ms), semantic cache as second (~10ms), then full pipeline as fallback.
"Semantic cache returned a wrong answer — how do you debug?" → Check the threshold. Print the similarity score of the hit. If 0.87 matched and gave wrong answer, raise threshold to 0.93. Also add a staleness TTL. Log all cache hits with query pairs for audit.
Other Caching Methods in RAG
Semantic cache solves the "same question phrased differently" problem. But there are several other caching layers in a RAG system — each targeting a different bottleneck.
1. Prefix / Prompt Caching (KV Cache)
Modern LLM APIs (Anthropic, OpenAI) support prompt prefix caching. If the beginning of your prompt (system prompt + retrieved docs) is identical across requests, the KV cache in the attention layers is reused. You only pay compute for the new part (the user question).
# Same system prompt + context repeated → cached by provider
system = "You are a helpful assistant. Use the following docs: [1000 tokens of context]"
# First call: full computation
response_1 = llm(system + "Question: What is the return policy?")
# Second call with same prefix: KV cache reused
# ~50-80% cheaper if prefix is long and identical
response_2 = llm(system + "Question: How long for refund?")
Works best when you have a fixed large system prompt OR when you can serve multiple queries against the same retrieved document set (batch mode). Anthropic charges 90% less for cached input tokens.
2. Embedding Cache
Embedding the same text twice is wasteful. Cache (text → vector) pairs in Redis or a local dict. Critical for the indexing pipeline where the same chunk might be re-processed multiple times.
# Embedding cache with TTL
import redis, hashlib, json
def cached_embed(text):
key = "emb:" + hashlib.md5(text.encode()).hexdigest()
cached = redis.get(key)
if cached:
return json.loads(cached)
vec = openai_embed(text)
redis.setex(key, 86400, json.dumps(vec)) # 24h TTL
return vec
3. Retrieval Result Cache
Cache the vector search results (chunk IDs + content) for a query embedding, not just the final LLM answer. This is useful when you want fresh LLM generation but don't want to repeat expensive retrieval.
# Cache retrieval results separately from LLM response
chunks_key = f"retrieval:{hash(q_emb)}"
chunks = redis.get(chunks_key)
if not chunks:
chunks = vector_search(q_emb)
redis.setex(chunks_key, 300, serialize(chunks)) # 5 min TTL
# LLM call happens every time (freshness), but retrieval is cached
response = llm.generate(query, chunks)
4. Query Normalization Before Caching
Before any cache lookup, normalize the query to reduce variation. This dramatically increases cache hit rate.
| Technique | Example: Before → After | Effect |
|---|---|---|
| Lowercase | "What is RAG?" → "what is rag?" | Removes case variation |
| Punctuation strip | "How does RAG work?!" → "how does rag work" | Cleans noise |
| Stopword remove | "Can you tell me what RAG is" → "RAG is" | Reduces semantic dilution |
| Spelling correction | "How does retreival work?" → "How does retrieval work?" | Unifies typos |
| LLM rewrite | Any phrasing → canonical form | Best quality, adds ~100ms |
5. Document-Level Generation Cache
If you generate summaries, metadata, or structured extractions from documents during ingestion, cache them. Re-ingesting the same doc should not re-run expensive LLM passes.
# During ingestion pipeline
doc_hash = sha256(doc_content)
if db.exists(f"doc:{doc_hash}"):
return db.get(f"doc:{doc_hash}") # already processed
summary = llm(f"Summarize: {doc_content}") # expensive
entities = llm(f"Extract entities: {doc_content}")
db.set(f"doc:{doc_hash}", {summary, entities})
return summary, entities
Cache Layer Stack — Full Picture
"Same question written 5 different ways — how do you cache efficiently?" → Normalize first (lowercase, strip punctuation). Then semantic cache with 0.93 threshold. Optionally: LLM-rewrite to canonical form before cache lookup. Store all query variants that led to the same answer, so future hits improve.
Chunking Methods — Deep Dive with Examples
Chunking is splitting your documents into pieces before embedding. The chunk is the unit of retrieval — what you embed, what gets returned, what goes into the LLM context. Bad chunking breaks everything downstream, regardless of how good your model is.
One chunk should contain one coherent idea. It should be self-contained enough that someone reading just that chunk can understand the point — without needing the surrounding text.
Method 1 — Fixed-Size / Token-Based Chunking
Split every N tokens, regardless of content. Simple, predictable.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=64, # overlap between consecutive chunks
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document)
Document: "The refund policy allows returns within 30 days. Products must be unused. | To start a return, visit our portal. Click 'Return Item'. You will receive a label within 2 hours."
With chunk_size=20 tokens, overlap=5:
Chunk 1: "The refund policy allows returns within 30 days. Products must be unused."
Chunk 2: "Products must be unused. To start a return, visit our portal."
← The overlap ("Products must be unused") ensures boundary context isn't lost.
| ✓ Pros | ✗ Cons |
|---|---|
| Dead simple to implement | Cuts sentences mid-thought |
| Predictable chunk sizes | Chunks may lack coherence |
| Fast | Chunk boundary = information loss |
Method 2 — Sentence-Based Chunking
Use NLP (spaCy/NLTK) to detect sentence boundaries. Group N sentences per chunk. Never cuts mid-sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
def sentence_chunks(text, sentences_per_chunk=5, overlap=1):
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
chunks = []
for i in range(0, len(sentences), sentences_per_chunk - overlap):
chunk = " ".join(sentences[i : i + sentences_per_chunk])
chunks.append(chunk)
return chunks
Method 3 — Semantic / Embedding-Based Chunking
Split based on meaning shifts. Embed each sentence, measure cosine similarity between adjacent sentences. When similarity drops sharply → that's a natural topic boundary → split there.
def semantic_chunk(sentences, threshold=0.85):
embeddings = [embed(s) for s in sentences]
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
sim = cosine_similarity(embeddings[i-1], embeddings[i])
if sim < threshold: # topic changed → new chunk
chunks.append(" ".join(current))
current = []
current.append(sentences[i])
chunks.append(" ".join(current))
return chunks
Sentences: S1="The product ships in 3 days." S2="Free shipping on orders over 500." S3="Our headquarters is in Mumbai." S4="We have offices in Delhi too."
Similarity(S2, S3) drops below threshold (shipping → geography = topic shift).
Result: Chunk 1 = [S1, S2] (shipping/delivery), Chunk 2 = [S3, S4] (location). Clean topic separation!
Method 4 — Hierarchical / Parent-Child Chunking
Store chunks at two resolutions. Small chunks (128 tokens) for retrieval — precise, high-signal. Large parent chunks (512–1024 tokens) returned to the LLM for context.
# LlamaIndex implementation
from llama_index.node_parser import HierarchicalNodeParser
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128]
)
# Query retrieves 128-token nodes → fetch 512-token parent → LLM gets 512
Parent-child is the best default for production RAG. Small chunks give precise retrieval (higher hit rate). Large parent gives the LLM enough context to answer well. Win-win.
Method 5 — Document-Structure Aware Chunking
Respect document structure. Split on headings, section breaks, code blocks, table boundaries. This keeps semantic units intact.
# Markdown-aware chunking
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [("#", "title"), ("##", "section"), ("###", "subsection")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)
# Each chunk carries metadata: {"title": "...", "section": "..."}
Method 6 — Proposition Chunking
Extract atomic facts ("propositions") from text using an LLM. Each chunk = one factual statement. Highest quality, highest cost. Used in research / high-stakes RAG.
prompt = """Extract all atomic factual propositions from the text below.
Each proposition should be a single, self-contained factual claim.
Return as JSON list.
Text: "Apple was founded in 1976 by Steve Jobs and Steve Wozniak.
The first product was the Apple I computer."
Output: [
"Apple was founded in 1976.",
"Apple was founded by Steve Jobs.",
"Apple was founded by Steve Wozniak.",
"The Apple I was Apple's first product.",
"The Apple I is a computer."
]"""
Chunking Strategy Decision Guide
| Scenario | Best Strategy | Why |
|---|---|---|
| Quick prototype | Fixed-size (512, overlap 64) | Fast, works well enough |
| Production, mixed docs | Parent-child hierarchical | Best recall + context |
| PDFs with structure | Structure-aware + parent-child | Preserves document logic |
| High-quality knowledge base | Semantic chunking | Topic-coherent chunks |
| Legal / medical (precision) | Proposition chunking | Atomic facts = no ambiguity |
| Code repositories | AST-based (function/class level) | Code = structure matters |
"Your RAG gives partial answers — it has SOME information but misses details." → Classic chunk boundary issue. The relevant detail is in the second half of a chunk that got cut. Fix: increase overlap, use sentence-based chunking, or switch to parent-child (retrieve narrow, return wide).
Document Ingestion Pipeline at Scale
Ingestion is everything that happens before a query. It's your data preparation layer. A poor ingestion pipeline cannot be fixed by a good retrieval model.
Full Pipeline Overview
Document Acquisition & Deduplication
Receive documents (upload, crawl, S3 event, webhook). Check hash (SHA256 of content) against registry — skip if already processed. Store original to S3/GCS immediately.
Format Detection & Parsing
Detect format (PDF, DOCX, HTML, CSV, PPTX). Route to appropriate parser. Extract raw text + preserve structure signals (headings, tables, page numbers).
Cleaning & Normalization
Strip headers/footers/page numbers. Remove boilerplate (nav bars, legal disclaimers if not relevant). Fix encoding issues. Normalize whitespace. Detect and remove duplicated content within the document.
Metadata Extraction
Extract or infer: title, author, date, doc_type, source_url, language, tags. Optionally run LLM to extract richer metadata: summary, key topics, entities. This metadata powers filtering later.
Chunking
Apply strategy appropriate for document type. Attach metadata to each chunk: {doc_id, chunk_index, page, section, parent_id}.
Embedding (Batch GPU)
Embed all chunks in batches (batch_size=64–256). GPU batch embedding is 50–100× faster than one-by-one API calls. Use vLLM or SentenceTransformer batch API. Cache embedding results.
Upsert to Vector DB
Bulk upsert chunks + embeddings into vector DB. Update document registry in PostgreSQL: {doc_id, status=indexed, indexed_at, chunk_count}.
Post-Indexing Enrichment (Optional)
Generate cross-document summaries. Build knowledge graph. Update search indexes (Elasticsearch BM25). Trigger notifications.
Parsing Strategies by Format
| Format | Tool | Special Handling |
|---|---|---|
| PDF (text-based) | PyMuPDF, pdfplumber | Multi-column layout detection, preserve reading order |
| PDF (scanned/image) | AWS Textract, Tesseract + layout | OCR required; table extraction mode |
| DOCX | python-docx | Extract heading hierarchy for structure-aware chunking |
| PPTX | python-pptx | Slide title + bullet text; slide = natural chunk boundary |
| HTML/Web | Trafilatura, BeautifulSoup | Remove nav/ads; preserve article structure |
| Spreadsheet (XLSX) | openpyxl + LLM | Convert rows to natural language: "Product X has price Y" |
| Code | Tree-sitter AST | Chunk at function/class boundaries, preserve docstrings |
| JSON/CSV | pandas + template | Schema-aware → natural language conversion |
Handling Tables
Tables are tricky for embedding. Option A: Convert to Markdown table (preserves structure, embeds OK). Option B: Convert each row to a natural language sentence. Option B generally retrieves better.
# Table row → natural language
row = {"Product": "Laptop X", "Price": 45000, "RAM": "16GB"}
text = f"Laptop X costs ₹{row['Price']} and has {row['RAM']} RAM."
# This embeds much better than raw JSON or CSV
Scale: Async Queue-Based Architecture
# Event-driven ingestion for scale
S3 upload event
→ SQS / Kafka message: {doc_id, s3_path, tenant_id}
→ Celery worker picks up job
→ Worker: parse → clean → chunk → batch embed
→ Upsert to Qdrant in batches of 100
→ Update PostgreSQL registry: status=indexed
→ Emit indexing_complete event
1. Scanned PDF with no OCR → silent empty embeddings. Always validate chunk text length (skip <20 tokens). 2. Missing metadata → can't filter later. Enforce metadata schema at ingestion time. 3. No dedup → same doc indexed 3× → retrieval returns duplicates constantly.
"How do you handle a 500-page legal PDF with scanned pages, embedded tables, and complex formatting?" → Textract for OCR + table detection. Convert tables to markdown. Structure-aware chunking on sections. Parent-child: sections as parents, paragraphs as children. Store page number in metadata for citation.
Keeping the Vector DB Updated
Documents change. New ones arrive. Old ones become outdated. Your vector DB must reflect the real world — and no, you don't re-embed everything every time.
The Core Update Operations
Additive Insert
Parse → chunk → embed → upsert new vectors. HNSW handles incremental inserts without full rebuild. O(log n) per insertion.
Delete + Re-insert
Delete all chunks with doc_id=X from vector DB. Re-run full ingestion pipeline on new version. Store version in metadata.
Soft or Hard Delete
Hard delete by doc_id filter (Qdrant/Pinecone support this). Or soft delete: mark as deleted in metadata, filter out at query time.
Full Re-embed
No way around this. Must re-embed ALL chunks. Run in background, swap index atomically. Never mix two embedding models in one index.
Do I Re-Chunk When Updating a Document?
When a document is updated, delete its old chunks and re-run the full pipeline (parse → chunk → embed → upsert) on the new version. Other documents are untouched. There is no cascading effect between documents in a vector DB.
The only time you re-chunk/re-embed EVERYTHING is:
| Scenario | Must Re-embed All? | Reason |
|---|---|---|
| New document added | No | Incremental upsert |
| Document updated | That doc only | Delete old chunks, insert new |
| Embedding model changed | Yes, all | Vector space is different |
| Chunk strategy changed | Yes, all | Old and new chunk sizes incompatible |
| Metadata schema changed | Maybe | If filters break; else just migrate metadata |
Incremental Indexing Pipeline
# Track document versions in PostgreSQL
CREATE TABLE doc_registry (
doc_id TEXT PRIMARY KEY,
content_hash TEXT, -- SHA256 of content
indexed_at TIMESTAMP,
embedding_model TEXT, -- 'text-embedding-3-small-v1'
chunk_strategy TEXT, -- 'hierarchical-512-128'
status TEXT -- 'indexed' | 'pending' | 'failed'
);
def should_reindex(doc_id, new_content):
record = db.get(doc_id)
if not record: return True # new doc
new_hash = sha256(new_content)
if record.content_hash != new_hash: return True # changed
return False # unchanged
Handling Embedding Model Migration
Switching from text-embedding-ada-002 to text-embedding-3-large means ALL your existing vectors are in the wrong space. Migration strategy:
Freshness SLA — How Fresh is Fresh Enough?
| Use Case | Acceptable Staleness | Strategy |
|---|---|---|
| Internal HR policy docs | 1 week | Weekly batch re-index of changed files |
| Product catalog | 1 hour | Webhook on product update → queue indexing |
| News / blog posts | Real-time | Event-driven: publish → auto-index |
| Legal contracts | Immediately on upload | Synchronous indexing on upload (small doc) |
"10,000 PDFs in your DB and you want to upgrade embedding model. How?" → Build new index in parallel. Re-embed in background using batch processing (GPU, 1000 docs/hour). Keep serving from old index. Once new index is complete and eval shows improvement, do atomic pointer swap. Zero downtime. Old index kept as rollback for 48h.
BM25 & Hybrid Search Explained
What is BM25?
BM25 (Best Match 25) is a sparse retrieval algorithm — the modern standard for keyword-based search. It's an evolution of TF-IDF that adds document length normalization and term saturation.
IDF(t): Inverse Document Frequency — rare terms score higher.
tf(t,d): Term frequency in the document.
|d|/avgdl: Normalization for document length.
k₁ (1.2–2.0): Term frequency saturation — prevents a term appearing 100× from dominating.
b (0–1, default 0.75): Length normalization strength.
Query: "Qwen3-0.6B LoRA training RunPod"
Dense (embedding) search might return generic ML training docs — because the embedding captures general ML semantics.
BM25 will find the exact doc that mentions "Qwen3", "0.6B", "LoRA", "RunPod" — because it looks for exact term matches. Product codes, model names, version numbers, IDs → BM25 wins.
Dense vs Sparse — When Each Wins
Dense Search (Vector)
- Semantic / conceptual queries
- "Something about shipping delays" → finds "delivery postponed"
- Paraphrases, synonyms, multilingual
- Questions, long natural language queries
- When exact terms don't matter
Sparse Search (BM25)
- Exact keyword / entity queries
- "Invoice #INV-2024-9812" → exact match
- Model names, product codes, IDs, names
- Legal/medical jargon (dense may miss)
- Short, keyword-style queries
Hybrid Search — Combining Both
Run both BM25 and dense retrieval independently. Merge their ranked result lists. This is the gold standard for production RAG.
Reciprocal Rank Fusion (RRF) — The Standard Merge Method
k=60 dampens the effect of very high ranks. A document ranked 1st in one list and 10th in another still scores much better than one ranked 50th in both.
def reciprocal_rank_fusion(rankings: list[list], k=60):
"""rankings: list of ranked doc-id lists"""
scores = {}
for ranked_list in rankings:
for rank, doc_id in enumerate(ranked_list, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Dense results ranked list
dense_ids = ["doc3", "doc1", "doc7", "doc2", ...]
# BM25 results ranked list
sparse_ids = ["doc1", "doc5", "doc3", "doc9", ...]
fused = reciprocal_rank_fusion([dense_ids, sparse_ids])
Weighted Score Fusion (Alternative)
# Normalize both scores to [0,1], then weight
alpha = 0.7 # weight for dense
hybrid_score = alpha * dense_score + (1 - alpha) * bm25_score
RRF is preferred because it's parameter-free and robust. Weighted fusion requires tuning alpha and careful normalization.
BGE-M3 — Unified Dense + Sparse Model
BGE-M3 is a single model that produces BOTH dense embeddings AND sparse BM25-like weights in one forward pass. No need to run two separate systems.
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3")
output = model.encode(texts, return_dense=True, return_sparse=True)
dense_vecs = output["dense_vecs"] # for ANN search
sparse_vecs = output["lexical_weights"] # for BM25-like scoring
Where to Run BM25 in Your Stack
| Option | Tool | When to Use |
|---|---|---|
| Elasticsearch / OpenSearch | Built-in BM25 | Already have ES in stack, large scale |
| Weaviate | Native hybrid search | Single system for both dense + sparse |
| rank_bm25 (Python) | In-memory library | Small scale, no extra infra |
| BGE-M3 | Model output | Single model, no separate BM25 service |
"User searches for 'ISO 27001 clause 6.1.2' and gets irrelevant results." → This is exactly the BM25 failure-to-have problem. Dense embeddings semantically match "security compliance" but miss the exact clause number. Add BM25 to the pipeline. Exact code/ID/number queries always need sparse retrieval.
Detecting When LLM Ignores Context
The model generates a plausible-sounding answer from its training data instead of the retrieved context. This is the most dangerous and hardest-to-detect failure in RAG.
Why This Happens
Weak System Prompt
"Use the context below" is easy for the model to ignore. The LLM's prior knowledge is very strong and wins when instructions are soft.
Context Not Relevant
Retrieved chunks don't contain the answer. The model has two choices: say "I don't know" or hallucinate. Most models choose to hallucinate.
Lost-in-the-Middle
Answer is in a chunk buried in the middle of a long context. LLMs attend more to beginning and end. Middle content gets "ignored."
Conflicting Information
Retrieved chunk says X, LLM's prior knowledge says Y. Model defaults to training knowledge rather than trusting the retrieved doc.
Detection Methods
Method 1: Faithfulness Check via LLM-as-Judge
After generation, run a second LLM call to verify each claim in the answer exists in the context.
eval_prompt = f"""
You are a faithfulness evaluator.
Retrieved Context:
{context}
Generated Answer:
{answer}
Task: For each claim in the answer, check if it is SUPPORTED,
CONTRADICTED, or NOT_FOUND in the context.
Return JSON:
{{
"verdict": "faithful" | "hallucinated",
"score": 0.0-1.0,
"unsupported_claims": ["claim1", ...]
}}
"""
result = llm.generate(eval_prompt) # judge model
Method 2: Token-Level Attribution
Check if key tokens/phrases in the answer appear verbatim or near-verbatim in the retrieved context. Simple, fast, deterministic.
def attribution_check(answer, context_chunks):
answer_sentences = split_sentences(answer)
unattributed = []
for sent in answer_sentences:
ngrams = extract_ngrams(sent, n=4)
found = any(ng in chunk for ng in ngrams
for chunk in context_chunks)
if not found:
unattributed.append(sent)
return unattributed # empty = fully attributed
Method 3: RAGAS Faithfulness Metric (Automated)
RAGAS decomposes the answer into claims, then checks each claim against the context. Score = (claims supported by context) / (total claims in answer).
from ragas.metrics import faithfulness
from datasets import Dataset
data = Dataset.from_dict({
"question": [query],
"answer": [answer],
"contexts": [retrieved_chunks],
})
score = faithfulness.score(data) # 0.0 – 1.0
Method 4: Self-Consistency Check
Ask the model: "Quote the exact sentence from the context that supports your answer." If it cannot produce a quote or produces a fake one, the answer was hallucinated.
verification_prompt = f"""
Context provided:
{context}
Your answer was: "{answer}"
Now quote the EXACT sentence from the context (word for word)
that supports this answer. If no such sentence exists, say:
"NOT_IN_CONTEXT"
"""
Prevention (Better than Detection)
| Technique | How | Effectiveness |
|---|---|---|
| Strong system prompt | "Answer ONLY using the context. If the answer is not in the context, say 'I don't have this information.'" | Medium-High |
| Spotlighting | Wrap context in special tags: <grounding>...</grounding>. Reference them explicitly. | Medium |
| Citation enforcement | Force structured output: answer must include [Source N] inline citation for every claim. | High |
| Better retrieval | If context actually contains the answer, model is less likely to drift. Fix retrieval first. | High |
| Temperature = 0 | Deterministic output follows context more faithfully than creative/high-temp output. | Medium |
| Model choice | Claude and GPT-4o follow "only use context" instructions more reliably than smaller models. | High |
"Faithfulness score is 0.65. How do you diagnose the root cause?" → Step 1: Look at which specific claims were unsupported. Step 2: Check if those facts exist in retrieved context — if yes, lost-in-middle or prompt issue. If no, retrieval failure — improve recall. Step 3: Check if it's a model issue by testing with explicit citation instruction. Each root cause has a different fix.
Domain-Specific vs Generic Embedding Models
Modern generic models (text-embedding-3-large, bge-m3) have gotten very good at domain understanding. But there are still cases where domain-specific models win — and cases where they don't.
The Honest Comparison (2024–25)
| Dimension | Generic (text-emb-3, bge-m3) | Domain-Specific |
|---|---|---|
| Out-of-box quality | Excellent on standard text | Excellent on domain text |
| Specialized jargon | OK — trained on diverse text | Best — trained on domain corpora |
| Rare domain terms | May embed weakly | Strong, seen in training |
| Abbreviations | "STEMI" ≈ general, not medical | "STEMI" = ST-elevation MI, precisely |
| Maintenance cost | Zero — provider maintains | You own it, you maintain it |
| MTEB benchmark | Top 5 overall | Top on domain-specific benchmarks |
When Generic Models Are Enough (Most Cases)
1. Your documents use standard English vocabulary, even if technical.
2. You're doing general enterprise RAG (HR, finance, operations, product docs).
3. You've evaluated on your data and generic performs well (>0.85 Hit Rate@5).
4. You can't afford the ops burden of maintaining a custom model.
5. Newer generic models (bge-m3, text-emb-3-large) are post-2023 and trained on domain text too.
When Domain-Specific Models Still Win
1. Highly specialized abbreviations: Medical (STEMI, CABG, eGFR), Legal (res ipsa loquitur, mens rea), Financial (EBITDA, CAGR, repo rate in context).
2. Cross-lingual domain: Indian legal documents mix English and regional language. Generic multilingual models lose nuance.
3. Code embedding: Generic models are mediocre on code. Use CodeBERT, GraphCodeBERT, or fine-tuned models.
4. You've measured a gap: If you benchmark and find Hit Rate@5 is 0.72 with generic vs 0.89 with domain — domain is worth the cost.
The Right Decision Process
Build an Eval Dataset First
Create 50–100 (question, relevant_doc) pairs from your actual documents. Use RAGAS synthetic generation or human annotators.
Benchmark 3–4 Candidate Models
Measure Hit Rate@5, MRR@10 on your eval set. Include: text-embedding-3-small, bge-large-en-v1.5, bge-m3, and domain model (if one exists).
If Gap Is Small (<5%) — Use Generic
The ops simplicity of a managed API embedding far outweighs a 3% recall improvement. Generic wins on total cost of ownership.
If Gap Is Large (>10%) — Fine-Tune Generic
Before adopting a niche domain model, try fine-tuning bge-large-en-v1.5 on your domain pairs. Often achieves domain-model quality with better maintainability.
Fine-Tuning Your Own Embedding (When Needed)
# sentence-transformers contrastive fine-tuning
from sentence_transformers import SentenceTransformer, InputExample, losses
train_examples = [
InputExample(texts=["What is STEMI?",
"ST-elevation myocardial infarction (STEMI) is..."],
label=1.0), # positive pair
InputExample(texts=["What is STEMI?",
"Annual leave policy is 15 days..."],
label=0.0), # negative pair
]
# Train with CosineSimilarityLoss or MultipleNegativesRankingLoss
# Hard negatives are key — easy negatives teach nothing
"Should we use a medical embedding model for our hospital RAG system or is text-embedding-3-large fine?" → Benchmark first, don't assume. If your queries are natural language ("What are side effects of metformin?"), text-embedding-3-large may be 95% as good. If your queries are abbreviation-heavy ("What labs for AKI in CKD3?"), medical model or fine-tuned generic likely wins. The data decides, not intuition.
What Is an Index? When/How to Refresh It
The word "index" gets overloaded in RAG. Let's be precise about what it means in different contexts.
Three Meanings of "Index" in RAG
Vector Index (ANN Index)
The data structure inside the vector DB (HNSW graph, IVF clusters) that enables fast similarity search. This is what makes search O(log n) instead of O(n).
Search Index (BM25)
An inverted index: maps each term → list of documents containing that term + positions. What Elasticsearch / Lucene maintains. Powers keyword search.
LlamaIndex / LangChain Index
A high-level abstraction in RAG frameworks representing "your indexed knowledge base" — the combination of embeddings + vector store + retriever config.
Deep Dive: HNSW Vector Index
HNSW (Hierarchical Navigable Small World) is a graph-based ANN index. Think of it as a multi-layer map:
Search starts at layer 2, greedily moves toward the query vector, drops to layer 1, refines, drops to layer 0, finds exact neighbors. Like navigation: highway → local road → street.
What "New Docs Not Indexed" Actually Means
When you add a document to the system but don't upsert its embeddings into the vector DB, those chunks are:
| Layer | State | Effect |
|---|---|---|
| S3 / file storage | ✓ Stored | File is safe |
| PostgreSQL doc registry | status='pending' | Tracked but not queryable |
| Vector DB (HNSW index) | ✗ Missing | ANN search won't find it |
| BM25 / Elasticsearch | ✗ Missing | Keyword search won't find it |
"New docs added but not indexed" = the ingestion pipeline stalled before the upsert step. The file exists but its embeddings were never inserted into the vector DB's HNSW graph.
Does HNSW Need to Be "Rebuilt"?
HNSW — Incremental (No Rebuild)
- New vectors inserted incrementally
- Each insert: O(log n) graph update
- No full rebuild needed for additions
- Qdrant, Weaviate, Chroma all support this
- Quality degrades very slightly over millions of inserts
IVF Flat — Requires Rebuild
- Centroids computed at index-build time (k-means)
- New vectors assigned to nearest centroid
- If corpus grows a lot, centroid quality degrades
- Recommend rebuilding every 10× data growth
- FAISS IVF users must plan for periodic rebuilds
When to Refresh / Rebuild the Index
| Scenario | Action | Urgency |
|---|---|---|
| New documents added | Incremental upsert (no rebuild) | Continuous |
| Embedding model changed | Full rebuild of new index | Planned migration |
| Chunk strategy changed | Full rebuild | Planned migration |
| IVF index, corpus grew 5–10× | Rebuild with new centroids | Periodic |
| HNSW degraded recall (measure it) | Re-optimize M / ef_construction | Rare |
| Many deletions (>20% of corpus) | Compact / rebuild to reclaim space | Periodic |
Monitoring Index Health
# Track these metrics
index_metrics = {
"total_vectors": qdrant.get_collection_info().vectors_count,
"pending_docs": db.count("SELECT COUNT(*) FROM docs WHERE status='pending'"),
"search_latency_p95": prometheus.query("p95(vector_search_ms)"),
"recall_at_5": run_eval_set(test_queries), # weekly eval
}
# Alert if pending_docs > 0 for more than 15 minutes
"User says 'The system doesn't know about the document I uploaded 10 minutes ago.' How do you investigate?" → Check document registry: is status='indexed' or 'pending'? If pending, ingestion pipeline stalled — check Celery/SQS queue for errors. If indexed, check vector count in DB — was the upsert confirmed? Also verify the doc's chunks aren't being filtered out by metadata filters at query time.
How to Find the Right K Value
K is the number of chunks you retrieve from the vector DB. It's one of the most impactful hyperparameters in RAG — and almost everyone picks it arbitrarily.
Why K Matters So Much
K Too Small (e.g., K=2)
- Miss the relevant chunk
- Low recall → incomplete answers
- "I don't have information about that"
- Fails on multi-hop questions
- Fast, cheap
K Too Large (e.g., K=100)
- Lots of irrelevant noise in context
- LLM gets confused, loses focus
- Exceeds context window
- Higher latency and cost
- Lost-in-the-middle effect worsens
The Measurement Approach — Hit Rate@K Curve
This is the correct, data-driven way to find K. Build an eval set, measure recall at each K value, find the elbow.
def hit_rate_at_k(eval_set, k_values=[1,3,5,10,20,50]):
"""eval_set: list of {query, relevant_chunk_ids}"""
results = {}
for k in k_values:
hits = 0
for item in eval_set:
retrieved = vector_search(item["query"], top_k=k)
retrieved_ids = {r.id for r in retrieved}
if retrieved_ids & set(item["relevant_chunk_ids"]):
hits += 1
results[k] = hits / len(eval_set)
return results
# Example output:
# K=1: 0.42 | K=3: 0.67 | K=5: 0.79 | K=10: 0.86 | K=20: 0.90 | K=50: 0.91
# ↑ ↑
# big jump here diminishing return
# → Elbow is at K=5 to K=10. Choose K=10 for retrieval, rerank to top-5.
The Two-Stage K Strategy
The modern best practice: use a LARGE K for retrieval (high recall), then rerank down to a SMALL K for the LLM context (high precision).
| K Type | Purpose | Typical Value | Constraint |
|---|---|---|---|
| K_retrieve | Ensure recall — get enough candidates | 20–100 | Reranker capacity |
| K_context | Precision — only best go to LLM | 3–10 | Context window budget |
Factors That Change the Right K
| Factor | Effect on K |
|---|---|
| Small chunks (128 tokens) | Need higher K — answer spans multiple small chunks |
| Large chunks (1024 tokens) | Can use lower K — each chunk is more complete |
| High document diversity | Need higher K — relevant info spread across many docs |
| Narrow, focused corpus | Lower K sufficient — less noise |
| Multi-hop questions | Need higher K — must retrieve multiple reasoning steps |
| Single-fact lookup | K=3 is often enough |
| Using reranker | Retrieve more (K=50), final context still small |
| No reranker | Smaller K — every chunk goes to LLM, precision matters |
Context Window Math
# Context budget calculation
model_context = 8192 # tokens (e.g., gpt-4o-mini)
system_prompt = 400
user_query = 100
output_budget = 800
chunk_size = 512
available_for_context = model_context - system_prompt - user_query - output_budget
# = 8192 - 400 - 100 - 800 = 6892 tokens
max_k = available_for_context // chunk_size
# = 6892 // 512 = 13 chunks maximum
# Practical K: 8–10 (leave buffer)
# With 128k context: K can be 50–100
Dynamic K — Adapting at Query Time
Don't use a fixed K for all queries. Use confidence-based dynamic K: retrieve more if early results have low similarity scores.
def dynamic_k_retrieve(query, base_k=5, max_k=20):
results = vector_search(query, top_k=base_k)
avg_score = mean([r.score for r in results])
if avg_score < 0.75: # low confidence, get more
results = vector_search(query, top_k=max_k)
return results
Small corpus (<1k chunks), 512-token chunks → K=5.
Medium corpus (1k–50k chunks), 512-token chunks → K=8–10.
Large corpus + small chunks (128 tokens) → K=15–20.
K_retrieve = 50 (always), K_context = 5–8 after reranking. This gives you the best of both worlds: high recall in retrieval, high precision in context.
"We went from K=5 to K=20 and answer quality went DOWN. Why?" → More noise. With K=20 and no reranker, chunks 6–20 are probably not relevant and they're diluting the LLM's focus. The LLM is confused by conflicting/off-topic information. Fix: add a reranker. Or raise K for retrieval but only pass top 5 to the LLM.