RAG Deep Dive — Part 2

Topic 01

Semantic Cache — How It Really Works

A semantic cache stores the result of a query and retrieves it again when a semantically similar (not necessarily identical) query comes in. Unlike a key-value cache where the key must match exactly, semantic cache uses vector similarity to find near-matches.

Core Mechanism

User Query

→

Embed query

→

Search cache index

→

Similarity ≥ threshold?

→

Return cached response

Similarity < threshold

→

Full RAG pipeline

→

Store (embedding, response) in cache

# Pseudocode — semantic cache lookup
def query_with_cache(user_query):
    q_emb = embed(user_query)

    # search the cache (it's a small vector store)
    hit = cache_index.search(q_emb, top_k=1)
    if hit.score >= 0.95:               # threshold
        return hit.cached_response       # 0 LLM cost, ~5ms

    # cache miss — run full pipeline
    response = full_rag_pipeline(user_query, q_emb)
    cache_index.upsert(q_emb, response)  # store for next time
    return response

⚠ Threshold Matters Enormously

0.95+ = very strict, only near-identical queries match. 0.85 = broader match, risk of wrong cached answer. For factual RAG: stay at 0.92–0.95. For conversational bots: 0.88–0.92 is often fine.

The Big Question: Does Chat History Break Semantic Cache?

Yes — if you're naive about it. If you cache raw user messages without context, you'll return wrong cached answers. The same message "What is the refund period?" means different things depending on chat history.

There are two strategies:

Strategy A — Cache Condensed Standalone Queries

Convert the multi-turn history into a single standalone query first (LLM rewrite), then cache that. The cache key is the standalone query embedding, not the raw user message.

standalone = llm("""Given chat history:
User: "What products do you sell?"
Bot: "We sell SaaS subscriptions..."
User: "What is the refund period?"

Rewrite the last question as a fully standalone question.""")
# → "What is the refund period for your SaaS subscriptions?"
response = query_with_cache(standalone)

Strategy B — Cache Only User-Independent Queries

Don't cache personal, session-specific, or history-dependent queries at all. Only cache queries that are truly universal: "What is the company's vacation policy?" is cacheable. "What is MY leave balance?" is not.

Where Semantic Cache Actually Helps — Real Scenarios

Customer Support Bot

FAQ Repetition

"How do I reset my password?" vs "forgot my password steps" vs "change password guide" — all hit the same cached answer. Massive win for support bots.

Internal Knowledge Base

Policy Questions

"What is the PTO policy?" / "How many vacation days do employees get?" → same cache hit. 40–60% of enterprise knowledge base queries are variations of the same few dozen questions.

Product Search / Catalog

Product Discovery

"Cheapest laptop under 50k" vs "budget laptops below 50000 rupees" → same answer. Cache works beautifully here.

Medical / Legal RAG

Definition Queries

"What is hypertension?" vs "define high blood pressure" → safe to cache. But "Should I take metoprolol?" is user-specific — never cache.

Code Assistant

Common Patterns

"How to read a file in Python?" asked by 1000 developers. Cache once, serve all. Cost savings are enormous.

Analytics Chatbot

Report Queries

"Show last month's revenue" — if it was expensive to compute, cache for 30 minutes. Time-bounded cache (TTL) + semantic similarity.

Cache Invalidation — When to Bust

Trigger	Action	Example
New document indexed	Invalidate related cache entries	New policy doc → bust policy-related cache
Document updated	Tag-based invalidation	Price changes → bust product Q&A cache
TTL expiry	Time-based expiry	News/events: cache for 1 hour only
User feedback "wrong answer"	Delete specific cache entry	User flags incorrect answer

Semantic Cache vs Exact Cache

Exact Cache (Redis KV)

Key = exact query string
Hit only on identical queries
O(1) lookup, near-zero overhead
Useless for natural language variation
Great for structured queries (API calls, SQL)

Semantic Cache

Key = query embedding vector
Hit on semantically similar queries
ANN lookup, ~5–15ms overhead
Handles natural language well
Risk: false positive hits at low thresholds

In production: Use both. Exact cache (Redis) as first check (~0ms), semantic cache as second (~10ms), then full pipeline as fallback.

💬 Interview Q

"Semantic cache returned a wrong answer — how do you debug?" → Check the threshold. Print the similarity score of the hit. If 0.87 matched and gave wrong answer, raise threshold to 0.93. Also add a staleness TTL. Log all cache hits with query pairs for audit.

Topic 02

Other Caching Methods in RAG

Semantic cache solves the "same question phrased differently" problem. But there are several other caching layers in a RAG system — each targeting a different bottleneck.

1. Prefix / Prompt Caching (KV Cache)

Modern LLM APIs (Anthropic, OpenAI) support prompt prefix caching. If the beginning of your prompt (system prompt + retrieved docs) is identical across requests, the KV cache in the attention layers is reused. You only pay compute for the new part (the user question).

# Same system prompt + context repeated → cached by provider
system = "You are a helpful assistant. Use the following docs: [1000 tokens of context]"

# First call: full computation
response_1 = llm(system + "Question: What is the return policy?")

# Second call with same prefix: KV cache reused
# ~50-80% cheaper if prefix is long and identical
response_2 = llm(system + "Question: How long for refund?")

When This Helps

Works best when you have a fixed large system prompt OR when you can serve multiple queries against the same retrieved document set (batch mode). Anthropic charges 90% less for cached input tokens.

2. Embedding Cache

Embedding the same text twice is wasteful. Cache (text → vector) pairs in Redis or a local dict. Critical for the indexing pipeline where the same chunk might be re-processed multiple times.

# Embedding cache with TTL
import redis, hashlib, json

def cached_embed(text):
    key = "emb:" + hashlib.md5(text.encode()).hexdigest()
    cached = redis.get(key)
    if cached:
        return json.loads(cached)
    vec = openai_embed(text)
    redis.setex(key, 86400, json.dumps(vec))  # 24h TTL
    return vec

3. Retrieval Result Cache

Cache the vector search results (chunk IDs + content) for a query embedding, not just the final LLM answer. This is useful when you want fresh LLM generation but don't want to repeat expensive retrieval.

# Cache retrieval results separately from LLM response
chunks_key = f"retrieval:{hash(q_emb)}"
chunks = redis.get(chunks_key)
if not chunks:
    chunks = vector_search(q_emb)
    redis.setex(chunks_key, 300, serialize(chunks))  # 5 min TTL

# LLM call happens every time (freshness), but retrieval is cached
response = llm.generate(query, chunks)

4. Query Normalization Before Caching

Before any cache lookup, normalize the query to reduce variation. This dramatically increases cache hit rate.

Technique	Example: Before → After	Effect
Lowercase	"What is RAG?" → "what is rag?"	Removes case variation
Punctuation strip	"How does RAG work?!" → "how does rag work"	Cleans noise
Stopword remove	"Can you tell me what RAG is" → "RAG is"	Reduces semantic dilution
Spelling correction	"How does retreival work?" → "How does retrieval work?"	Unifies typos
LLM rewrite	Any phrasing → canonical form	Best quality, adds ~100ms

5. Document-Level Generation Cache

If you generate summaries, metadata, or structured extractions from documents during ingestion, cache them. Re-ingesting the same doc should not re-run expensive LLM passes.

# During ingestion pipeline
doc_hash = sha256(doc_content)
if db.exists(f"doc:{doc_hash}"):
    return db.get(f"doc:{doc_hash}")  # already processed

summary = llm(f"Summarize: {doc_content}")   # expensive
entities = llm(f"Extract entities: {doc_content}")
db.set(f"doc:{doc_hash}", {summary, entities})
return summary, entities

Cache Layer Stack — Full Picture

Query

→

Normalize

→

Exact cache (Redis, ~0ms)

→

Semantic cache (~10ms)

→

Retrieval cache

→

LLM (prompt prefix cache)

💬 Interview Q

"Same question written 5 different ways — how do you cache efficiently?" → Normalize first (lowercase, strip punctuation). Then semantic cache with 0.93 threshold. Optionally: LLM-rewrite to canonical form before cache lookup. Store all query variants that led to the same answer, so future hits improve.

Topic 03

Chunking Methods — Deep Dive with Examples

Chunking is splitting your documents into pieces before embedding. The chunk is the unit of retrieval — what you embed, what gets returned, what goes into the LLM context. Bad chunking breaks everything downstream, regardless of how good your model is.

🎯 Golden Rule

One chunk should contain one coherent idea. It should be self-contained enough that someone reading just that chunk can understand the point — without needing the surrounding text.

Method 1 — Fixed-Size / Token-Based Chunking

Split every N tokens, regardless of content. Simple, predictable.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,      # tokens per chunk
    chunk_overlap=64,   # overlap between consecutive chunks
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document)

📄 Example

Document: "The refund policy allows returns within 30 days. Products must be unused. | To start a return, visit our portal. Click 'Return Item'. You will receive a label within 2 hours."

With chunk_size=20 tokens, overlap=5:
Chunk 1: "The refund policy allows returns within 30 days. Products must be unused."
Chunk 2: "Products must be unused. To start a return, visit our portal."
← The overlap ("Products must be unused") ensures boundary context isn't lost.

✓ Pros	✗ Cons
Dead simple to implement	Cuts sentences mid-thought
Predictable chunk sizes	Chunks may lack coherence
Fast	Chunk boundary = information loss

Method 2 — Sentence-Based Chunking

Use NLP (spaCy/NLTK) to detect sentence boundaries. Group N sentences per chunk. Never cuts mid-sentence.

import spacy
nlp = spacy.load("en_core_web_sm")

def sentence_chunks(text, sentences_per_chunk=5, overlap=1):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk - overlap):
        chunk = " ".join(sentences[i : i + sentences_per_chunk])
        chunks.append(chunk)
    return chunks

Method 3 — Semantic / Embedding-Based Chunking

Split based on meaning shifts. Embed each sentence, measure cosine similarity between adjacent sentences. When similarity drops sharply → that's a natural topic boundary → split there.

def semantic_chunk(sentences, threshold=0.85):
    embeddings = [embed(s) for s in sentences]
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        sim = cosine_similarity(embeddings[i-1], embeddings[i])
        if sim < threshold:   # topic changed → new chunk
            chunks.append(" ".join(current))
            current = []
        current.append(sentences[i])
    chunks.append(" ".join(current))
    return chunks

📄 Example

Sentences: S1="The product ships in 3 days." S2="Free shipping on orders over 500." S3="Our headquarters is in Mumbai." S4="We have offices in Delhi too."

Similarity(S2, S3) drops below threshold (shipping → geography = topic shift).
Result: Chunk 1 = [S1, S2] (shipping/delivery), Chunk 2 = [S3, S4] (location). Clean topic separation!

Method 4 — Hierarchical / Parent-Child Chunking

Store chunks at two resolutions. Small chunks (128 tokens) for retrieval — precise, high-signal. Large parent chunks (512–1024 tokens) returned to the LLM for context.

Document

→

Parent (512 tok)

→

Children (128 tok) ← embedded

Query → Child match

→

Lookup parent_id

→

Return parent to LLM

# LlamaIndex implementation
from llama_index.node_parser import HierarchicalNodeParser

parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)
# Query retrieves 128-token nodes → fetch 512-token parent → LLM gets 512

✓ Best Practice in Production

Parent-child is the best default for production RAG. Small chunks give precise retrieval (higher hit rate). Large parent gives the LLM enough context to answer well. Win-win.

Method 5 — Document-Structure Aware Chunking

Respect document structure. Split on headings, section breaks, code blocks, table boundaries. This keeps semantic units intact.

# Markdown-aware chunking
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [("#", "title"), ("##", "section"), ("###", "subsection")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)
# Each chunk carries metadata: {"title": "...", "section": "..."}

Method 6 — Proposition Chunking

Extract atomic facts ("propositions") from text using an LLM. Each chunk = one factual statement. Highest quality, highest cost. Used in research / high-stakes RAG.

prompt = """Extract all atomic factual propositions from the text below.
Each proposition should be a single, self-contained factual claim.
Return as JSON list.

Text: "Apple was founded in 1976 by Steve Jobs and Steve Wozniak. 
The first product was the Apple I computer."

Output: [
  "Apple was founded in 1976.",
  "Apple was founded by Steve Jobs.",
  "Apple was founded by Steve Wozniak.",
  "The Apple I was Apple's first product.",
  "The Apple I is a computer."
]"""

Chunking Strategy Decision Guide

Scenario	Best Strategy	Why
Quick prototype	Fixed-size (512, overlap 64)	Fast, works well enough
Production, mixed docs	Parent-child hierarchical	Best recall + context
PDFs with structure	Structure-aware + parent-child	Preserves document logic
High-quality knowledge base	Semantic chunking	Topic-coherent chunks
Legal / medical (precision)	Proposition chunking	Atomic facts = no ambiguity
Code repositories	AST-based (function/class level)	Code = structure matters

💬 Interview Q

"Your RAG gives partial answers — it has SOME information but misses details." → Classic chunk boundary issue. The relevant detail is in the second half of a chunk that got cut. Fix: increase overlap, use sentence-based chunking, or switch to parent-child (retrieve narrow, return wide).

Topic 04

Document Ingestion Pipeline at Scale

Ingestion is everything that happens before a query. It's your data preparation layer. A poor ingestion pipeline cannot be fixed by a good retrieval model.

Full Pipeline Overview

Document Acquisition & Deduplication

Receive documents (upload, crawl, S3 event, webhook). Check hash (SHA256 of content) against registry — skip if already processed. Store original to S3/GCS immediately.

Format Detection & Parsing

Detect format (PDF, DOCX, HTML, CSV, PPTX). Route to appropriate parser. Extract raw text + preserve structure signals (headings, tables, page numbers).

Cleaning & Normalization

Strip headers/footers/page numbers. Remove boilerplate (nav bars, legal disclaimers if not relevant). Fix encoding issues. Normalize whitespace. Detect and remove duplicated content within the document.

Metadata Extraction

Extract or infer: title, author, date, doc_type, source_url, language, tags. Optionally run LLM to extract richer metadata: summary, key topics, entities. This metadata powers filtering later.

Chunking

Apply strategy appropriate for document type. Attach metadata to each chunk: {doc_id, chunk_index, page, section, parent_id}.

Embedding (Batch GPU)

Embed all chunks in batches (batch_size=64–256). GPU batch embedding is 50–100× faster than one-by-one API calls. Use vLLM or SentenceTransformer batch API. Cache embedding results.

Upsert to Vector DB

Bulk upsert chunks + embeddings into vector DB. Update document registry in PostgreSQL: {doc_id, status=indexed, indexed_at, chunk_count}.

Post-Indexing Enrichment (Optional)

Generate cross-document summaries. Build knowledge graph. Update search indexes (Elasticsearch BM25). Trigger notifications.

Parsing Strategies by Format

Format	Tool	Special Handling
PDF (text-based)	PyMuPDF, pdfplumber	Multi-column layout detection, preserve reading order
PDF (scanned/image)	AWS Textract, Tesseract + layout	OCR required; table extraction mode
DOCX	python-docx	Extract heading hierarchy for structure-aware chunking
PPTX	python-pptx	Slide title + bullet text; slide = natural chunk boundary
HTML/Web	Trafilatura, BeautifulSoup	Remove nav/ads; preserve article structure
Spreadsheet (XLSX)	openpyxl + LLM	Convert rows to natural language: "Product X has price Y"
Code	Tree-sitter AST	Chunk at function/class boundaries, preserve docstrings
JSON/CSV	pandas + template	Schema-aware → natural language conversion

Handling Tables

Tables are tricky for embedding. Option A: Convert to Markdown table (preserves structure, embeds OK). Option B: Convert each row to a natural language sentence. Option B generally retrieves better.

# Table row → natural language
row = {"Product": "Laptop X", "Price": 45000, "RAM": "16GB"}
text = f"Laptop X costs ₹{row['Price']} and has {row['RAM']} RAM."
# This embeds much better than raw JSON or CSV

Scale: Async Queue-Based Architecture

# Event-driven ingestion for scale
S3 upload event
  → SQS / Kafka message: {doc_id, s3_path, tenant_id}
  → Celery worker picks up job
  → Worker: parse → clean → chunk → batch embed
  → Upsert to Qdrant in batches of 100
  → Update PostgreSQL registry: status=indexed
  → Emit indexing_complete event

⚠ Most Common Ingestion Bugs

1. Scanned PDF with no OCR → silent empty embeddings. Always validate chunk text length (skip <20 tokens). 2. Missing metadata → can't filter later. Enforce metadata schema at ingestion time. 3. No dedup → same doc indexed 3× → retrieval returns duplicates constantly.

💬 Interview Q

"How do you handle a 500-page legal PDF with scanned pages, embedded tables, and complex formatting?" → Textract for OCR + table detection. Convert tables to markdown. Structure-aware chunking on sections. Parent-child: sections as parents, paragraphs as children. Store page number in metadata for citation.

Topic 05

Keeping the Vector DB Updated

Documents change. New ones arrive. Old ones become outdated. Your vector DB must reflect the real world — and no, you don't re-embed everything every time.

The Core Update Operations

New Document

Additive Insert

Parse → chunk → embed → upsert new vectors. HNSW handles incremental inserts without full rebuild. O(log n) per insertion.

Document Updated

Delete + Re-insert

Delete all chunks with doc_id=X from vector DB. Re-run full ingestion pipeline on new version. Store version in metadata.

Document Deleted

Soft or Hard Delete

Hard delete by doc_id filter (Qdrant/Pinecone support this). Or soft delete: mark as deleted in metadata, filter out at query time.

Embedding Model Changed

Full Re-embed

No way around this. Must re-embed ALL chunks. Run in background, swap index atomically. Never mix two embedding models in one index.

Do I Re-Chunk When Updating a Document?

Answer: Yes, for updated documents. No, for others.

When a document is updated, delete its old chunks and re-run the full pipeline (parse → chunk → embed → upsert) on the new version. Other documents are untouched. There is no cascading effect between documents in a vector DB.

The only time you re-chunk/re-embed EVERYTHING is:

Scenario	Must Re-embed All?	Reason
New document added	No	Incremental upsert
Document updated	That doc only	Delete old chunks, insert new
Embedding model changed	Yes, all	Vector space is different
Chunk strategy changed	Yes, all	Old and new chunk sizes incompatible
Metadata schema changed	Maybe	If filters break; else just migrate metadata

Incremental Indexing Pipeline

# Track document versions in PostgreSQL
CREATE TABLE doc_registry (
  doc_id TEXT PRIMARY KEY,
  content_hash TEXT,        -- SHA256 of content
  indexed_at TIMESTAMP,
  embedding_model TEXT,     -- 'text-embedding-3-small-v1'
  chunk_strategy TEXT,      -- 'hierarchical-512-128'
  status TEXT               -- 'indexed' | 'pending' | 'failed'
);

def should_reindex(doc_id, new_content):
    record = db.get(doc_id)
    if not record: return True         # new doc
    new_hash = sha256(new_content)
    if record.content_hash != new_hash: return True  # changed
    return False                                    # unchanged

Handling Embedding Model Migration

Switching from text-embedding-ada-002 to text-embedding-3-large means ALL your existing vectors are in the wrong space. Migration strategy:

Create new index (v2)

→

Re-embed all docs (background)

→

Run both indexes in parallel

→

A/B test quality

→

Atomic swap: v1→v2

→

Delete v1

Freshness SLA — How Fresh is Fresh Enough?

Use Case	Acceptable Staleness	Strategy
Internal HR policy docs	1 week	Weekly batch re-index of changed files
Product catalog	1 hour	Webhook on product update → queue indexing
News / blog posts	Real-time	Event-driven: publish → auto-index
Legal contracts	Immediately on upload	Synchronous indexing on upload (small doc)

💬 Interview Q

"10,000 PDFs in your DB and you want to upgrade embedding model. How?" → Build new index in parallel. Re-embed in background using batch processing (GPU, 1000 docs/hour). Keep serving from old index. Once new index is complete and eval shows improvement, do atomic pointer swap. Zero downtime. Old index kept as rollback for 48h.

Topic 06

BM25 & Hybrid Search Explained

What is BM25?

BM25 (Best Match 25) is a sparse retrieval algorithm — the modern standard for keyword-based search. It's an evolution of TF-IDF that adds document length normalization and term saturation.

BM25(q, d) = Σ IDF(tᵢ) × [ tf(tᵢ,d) × (k₁+1) ] / [ tf(tᵢ,d) + k₁(1 - b + b·|d|/avgdl) ]

IDF(t): Inverse Document Frequency — rare terms score higher.
tf(t,d): Term frequency in the document.
|d|/avgdl: Normalization for document length.
k₁ (1.2–2.0): Term frequency saturation — prevents a term appearing 100× from dominating.
b (0–1, default 0.75): Length normalization strength.

📄 Example — Why BM25 Matters

Query: "Qwen3-0.6B LoRA training RunPod"

Dense (embedding) search might return generic ML training docs — because the embedding captures general ML semantics.

BM25 will find the exact doc that mentions "Qwen3", "0.6B", "LoRA", "RunPod" — because it looks for exact term matches. Product codes, model names, version numbers, IDs → BM25 wins.

Dense vs Sparse — When Each Wins

Dense Search (Vector)

Semantic / conceptual queries
"Something about shipping delays" → finds "delivery postponed"
Paraphrases, synonyms, multilingual
Questions, long natural language queries
When exact terms don't matter

Sparse Search (BM25)

Exact keyword / entity queries
"Invoice #INV-2024-9812" → exact match
Model names, product codes, IDs, names
Legal/medical jargon (dense may miss)
Short, keyword-style queries

Hybrid Search — Combining Both

Run both BM25 and dense retrieval independently. Merge their ranked result lists. This is the gold standard for production RAG.

Reciprocal Rank Fusion (RRF) — The Standard Merge Method

RRF_score(d) = Σ over rankings: 1 / (k + rank(d)) [k = 60]

k=60 dampens the effect of very high ranks. A document ranked 1st in one list and 10th in another still scores much better than one ranked 50th in both.

def reciprocal_rank_fusion(rankings: list[list], k=60):
    """rankings: list of ranked doc-id lists"""
    scores = {}
    for ranked_list in rankings:
        for rank, doc_id in enumerate(ranked_list, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Dense results ranked list
dense_ids   = ["doc3", "doc1", "doc7", "doc2", ...]
# BM25 results ranked list
sparse_ids  = ["doc1", "doc5", "doc3", "doc9", ...]

fused = reciprocal_rank_fusion([dense_ids, sparse_ids])

Weighted Score Fusion (Alternative)

# Normalize both scores to [0,1], then weight
alpha = 0.7   # weight for dense
hybrid_score = alpha * dense_score + (1 - alpha) * bm25_score

RRF is preferred because it's parameter-free and robust. Weighted fusion requires tuning alpha and careful normalization.

BGE-M3 — Unified Dense + Sparse Model

BGE-M3 is a single model that produces BOTH dense embeddings AND sparse BM25-like weights in one forward pass. No need to run two separate systems.

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("BAAI/bge-m3")
output = model.encode(texts, return_dense=True, return_sparse=True)
dense_vecs  = output["dense_vecs"]    # for ANN search
sparse_vecs = output["lexical_weights"] # for BM25-like scoring

Where to Run BM25 in Your Stack

Option	Tool	When to Use
Elasticsearch / OpenSearch	Built-in BM25	Already have ES in stack, large scale
Weaviate	Native hybrid search	Single system for both dense + sparse
rank_bm25 (Python)	In-memory library	Small scale, no extra infra
BGE-M3	Model output	Single model, no separate BM25 service

💬 Interview Q

"User searches for 'ISO 27001 clause 6.1.2' and gets irrelevant results." → This is exactly the BM25 failure-to-have problem. Dense embeddings semantically match "security compliance" but miss the exact clause number. Add BM25 to the pipeline. Exact code/ID/number queries always need sparse retrieval.

Topic 07

Detecting When LLM Ignores Context

The model generates a plausible-sounding answer from its training data instead of the retrieved context. This is the most dangerous and hardest-to-detect failure in RAG.

Why This Happens

Root Cause

Weak System Prompt

"Use the context below" is easy for the model to ignore. The LLM's prior knowledge is very strong and wins when instructions are soft.

Root Cause

Context Not Relevant

Retrieved chunks don't contain the answer. The model has two choices: say "I don't know" or hallucinate. Most models choose to hallucinate.

Root Cause

Lost-in-the-Middle

Answer is in a chunk buried in the middle of a long context. LLMs attend more to beginning and end. Middle content gets "ignored."

Root Cause

Conflicting Information

Retrieved chunk says X, LLM's prior knowledge says Y. Model defaults to training knowledge rather than trusting the retrieved doc.

Detection Methods

Method 1: Faithfulness Check via LLM-as-Judge

After generation, run a second LLM call to verify each claim in the answer exists in the context.

eval_prompt = f"""
You are a faithfulness evaluator.

Retrieved Context:
{context}

Generated Answer:
{answer}

Task: For each claim in the answer, check if it is SUPPORTED, 
CONTRADICTED, or NOT_FOUND in the context.

Return JSON:
{{
  "verdict": "faithful" | "hallucinated",
  "score": 0.0-1.0,
  "unsupported_claims": ["claim1", ...]
}}
"""
result = llm.generate(eval_prompt)  # judge model

Method 2: Token-Level Attribution

Check if key tokens/phrases in the answer appear verbatim or near-verbatim in the retrieved context. Simple, fast, deterministic.

def attribution_check(answer, context_chunks):
    answer_sentences = split_sentences(answer)
    unattributed = []
    for sent in answer_sentences:
        ngrams = extract_ngrams(sent, n=4)
        found = any(ng in chunk for ng in ngrams
                    for chunk in context_chunks)
        if not found:
            unattributed.append(sent)
    return unattributed  # empty = fully attributed

Method 3: RAGAS Faithfulness Metric (Automated)

RAGAS decomposes the answer into claims, then checks each claim against the context. Score = (claims supported by context) / (total claims in answer).

from ragas.metrics import faithfulness
from datasets import Dataset

data = Dataset.from_dict({
    "question": [query],
    "answer": [answer],
    "contexts": [retrieved_chunks],
})
score = faithfulness.score(data)  # 0.0 – 1.0

Method 4: Self-Consistency Check

Ask the model: "Quote the exact sentence from the context that supports your answer." If it cannot produce a quote or produces a fake one, the answer was hallucinated.

verification_prompt = f"""
Context provided:
{context}

Your answer was: "{answer}"

Now quote the EXACT sentence from the context (word for word) 
that supports this answer. If no such sentence exists, say: 
"NOT_IN_CONTEXT"
"""

Prevention (Better than Detection)

Technique	How	Effectiveness
Strong system prompt	"Answer ONLY using the context. If the answer is not in the context, say 'I don't have this information.'"	Medium-High
Spotlighting	Wrap context in special tags: `<grounding>...</grounding>`. Reference them explicitly.	Medium
Citation enforcement	Force structured output: answer must include [Source N] inline citation for every claim.	High
Better retrieval	If context actually contains the answer, model is less likely to drift. Fix retrieval first.	High
Temperature = 0	Deterministic output follows context more faithfully than creative/high-temp output.	Medium
Model choice	Claude and GPT-4o follow "only use context" instructions more reliably than smaller models.	High

💬 Interview Q

"Faithfulness score is 0.65. How do you diagnose the root cause?" → Step 1: Look at which specific claims were unsupported. Step 2: Check if those facts exist in retrieved context — if yes, lost-in-middle or prompt issue. If no, retrieval failure — improve recall. Step 3: Check if it's a model issue by testing with explicit citation instruction. Each root cause has a different fix.

Topic 08

Domain-Specific vs Generic Embedding Models

Modern generic models (text-embedding-3-large, bge-m3) have gotten very good at domain understanding. But there are still cases where domain-specific models win — and cases where they don't.

The Honest Comparison (2024–25)

Dimension	Generic (text-emb-3, bge-m3)	Domain-Specific
Out-of-box quality	Excellent on standard text	Excellent on domain text
Specialized jargon	OK — trained on diverse text	Best — trained on domain corpora
Rare domain terms	May embed weakly	Strong, seen in training
Abbreviations	"STEMI" ≈ general, not medical	"STEMI" = ST-elevation MI, precisely
Maintenance cost	Zero — provider maintains	You own it, you maintain it
MTEB benchmark	Top 5 overall	Top on domain-specific benchmarks

When Generic Models Are Enough (Most Cases)

✓ Use Generic If

1. Your documents use standard English vocabulary, even if technical.
2. You're doing general enterprise RAG (HR, finance, operations, product docs).
3. You've evaluated on your data and generic performs well (>0.85 Hit Rate@5).
4. You can't afford the ops burden of maintaining a custom model.
5. Newer generic models (bge-m3, text-emb-3-large) are post-2023 and trained on domain text too.

When Domain-Specific Models Still Win

Domain Model Still Wins When

1. Highly specialized abbreviations: Medical (STEMI, CABG, eGFR), Legal (res ipsa loquitur, mens rea), Financial (EBITDA, CAGR, repo rate in context).
2. Cross-lingual domain: Indian legal documents mix English and regional language. Generic multilingual models lose nuance.
3. Code embedding: Generic models are mediocre on code. Use CodeBERT, GraphCodeBERT, or fine-tuned models.
4. You've measured a gap: If you benchmark and find Hit Rate@5 is 0.72 with generic vs 0.89 with domain — domain is worth the cost.

The Right Decision Process

Build an Eval Dataset First

Create 50–100 (question, relevant_doc) pairs from your actual documents. Use RAGAS synthetic generation or human annotators.

Benchmark 3–4 Candidate Models

Measure Hit Rate@5, MRR@10 on your eval set. Include: text-embedding-3-small, bge-large-en-v1.5, bge-m3, and domain model (if one exists).

If Gap Is Small (<5%) — Use Generic

The ops simplicity of a managed API embedding far outweighs a 3% recall improvement. Generic wins on total cost of ownership.

If Gap Is Large (>10%) — Fine-Tune Generic

Before adopting a niche domain model, try fine-tuning bge-large-en-v1.5 on your domain pairs. Often achieves domain-model quality with better maintainability.

Fine-Tuning Your Own Embedding (When Needed)

# sentence-transformers contrastive fine-tuning
from sentence_transformers import SentenceTransformer, InputExample, losses

train_examples = [
    InputExample(texts=["What is STEMI?",
                        "ST-elevation myocardial infarction (STEMI) is..."],
                 label=1.0),   # positive pair
    InputExample(texts=["What is STEMI?",
                        "Annual leave policy is 15 days..."],
                 label=0.0),   # negative pair
]
# Train with CosineSimilarityLoss or MultipleNegativesRankingLoss
# Hard negatives are key — easy negatives teach nothing

💬 Interview Q

"Should we use a medical embedding model for our hospital RAG system or is text-embedding-3-large fine?" → Benchmark first, don't assume. If your queries are natural language ("What are side effects of metformin?"), text-embedding-3-large may be 95% as good. If your queries are abbreviation-heavy ("What labs for AKI in CKD3?"), medical model or fine-tuned generic likely wins. The data decides, not intuition.

Topic 09

What Is an Index? When/How to Refresh It

The word "index" gets overloaded in RAG. Let's be precise about what it means in different contexts.

Three Meanings of "Index" in RAG

Meaning 1

Vector Index (ANN Index)

The data structure inside the vector DB (HNSW graph, IVF clusters) that enables fast similarity search. This is what makes search O(log n) instead of O(n).

Meaning 2

Search Index (BM25)

An inverted index: maps each term → list of documents containing that term + positions. What Elasticsearch / Lucene maintains. Powers keyword search.

Meaning 3

LlamaIndex / LangChain Index

A high-level abstraction in RAG frameworks representing "your indexed knowledge base" — the combination of embeddings + vector store + retriever config.

Deep Dive: HNSW Vector Index

HNSW (Hierarchical Navigable Small World) is a graph-based ANN index. Think of it as a multi-layer map:

Layer 2 (sparse) — long-range "highways"

Layer 1 (medium density) — regional connections

Layer 0 (dense) — all nodes, local connections

Search starts at layer 2, greedily moves toward the query vector, drops to layer 1, refines, drops to layer 0, finds exact neighbors. Like navigation: highway → local road → street.

What "New Docs Not Indexed" Actually Means

When you add a document to the system but don't upsert its embeddings into the vector DB, those chunks are:

Layer	State	Effect
S3 / file storage	✓ Stored	File is safe
PostgreSQL doc registry	status='pending'	Tracked but not queryable
Vector DB (HNSW index)	✗ Missing	ANN search won't find it
BM25 / Elasticsearch	✗ Missing	Keyword search won't find it

"New docs added but not indexed" = the ingestion pipeline stalled before the upsert step. The file exists but its embeddings were never inserted into the vector DB's HNSW graph.

Does HNSW Need to Be "Rebuilt"?

HNSW — Incremental (No Rebuild)

New vectors inserted incrementally
Each insert: O(log n) graph update
No full rebuild needed for additions
Qdrant, Weaviate, Chroma all support this
Quality degrades very slightly over millions of inserts

IVF Flat — Requires Rebuild

Centroids computed at index-build time (k-means)
New vectors assigned to nearest centroid
If corpus grows a lot, centroid quality degrades
Recommend rebuilding every 10× data growth
FAISS IVF users must plan for periodic rebuilds

When to Refresh / Rebuild the Index

Scenario	Action	Urgency
New documents added	Incremental upsert (no rebuild)	Continuous
Embedding model changed	Full rebuild of new index	Planned migration
Chunk strategy changed	Full rebuild	Planned migration
IVF index, corpus grew 5–10×	Rebuild with new centroids	Periodic
HNSW degraded recall (measure it)	Re-optimize M / ef_construction	Rare
Many deletions (>20% of corpus)	Compact / rebuild to reclaim space	Periodic

Monitoring Index Health

# Track these metrics
index_metrics = {
  "total_vectors": qdrant.get_collection_info().vectors_count,
  "pending_docs": db.count("SELECT COUNT(*) FROM docs WHERE status='pending'"),
  "search_latency_p95": prometheus.query("p95(vector_search_ms)"),
  "recall_at_5": run_eval_set(test_queries),  # weekly eval
}
# Alert if pending_docs > 0 for more than 15 minutes

💬 Interview Q

"User says 'The system doesn't know about the document I uploaded 10 minutes ago.' How do you investigate?" → Check document registry: is status='indexed' or 'pending'? If pending, ingestion pipeline stalled — check Celery/SQS queue for errors. If indexed, check vector count in DB — was the upsert confirmed? Also verify the doc's chunks aren't being filtered out by metadata filters at query time.

Topic 10

How to Find the Right K Value

K is the number of chunks you retrieve from the vector DB. It's one of the most impactful hyperparameters in RAG — and almost everyone picks it arbitrarily.

Why K Matters So Much

K Too Small (e.g., K=2)

Miss the relevant chunk
Low recall → incomplete answers
"I don't have information about that"
Fails on multi-hop questions
Fast, cheap

K Too Large (e.g., K=100)

Lots of irrelevant noise in context
LLM gets confused, loses focus
Exceeds context window
Higher latency and cost
Lost-in-the-middle effect worsens

The Measurement Approach — Hit Rate@K Curve

This is the correct, data-driven way to find K. Build an eval set, measure recall at each K value, find the elbow.

def hit_rate_at_k(eval_set, k_values=[1,3,5,10,20,50]):
    """eval_set: list of {query, relevant_chunk_ids}"""
    results = {}
    for k in k_values:
        hits = 0
        for item in eval_set:
            retrieved = vector_search(item["query"], top_k=k)
            retrieved_ids = {r.id for r in retrieved}
            if retrieved_ids & set(item["relevant_chunk_ids"]):
                hits += 1
        results[k] = hits / len(eval_set)
    return results

# Example output:
# K=1: 0.42 | K=3: 0.67 | K=5: 0.79 | K=10: 0.86 | K=20: 0.90 | K=50: 0.91
#                                 ↑                                         ↑
#                           big jump here                        diminishing return
# → Elbow is at K=5 to K=10. Choose K=10 for retrieval, rerank to top-5.

The Two-Stage K Strategy

The modern best practice: use a LARGE K for retrieval (high recall), then rerank down to a SMALL K for the LLM context (high precision).

K_retrieve = 50

→

Cross-encoder rerank

→

K_context = 5

→

LLM

K Type	Purpose	Typical Value	Constraint
K_retrieve	Ensure recall — get enough candidates	20–100	Reranker capacity
K_context	Precision — only best go to LLM	3–10	Context window budget

Factors That Change the Right K

Factor	Effect on K
Small chunks (128 tokens)	Need higher K — answer spans multiple small chunks
Large chunks (1024 tokens)	Can use lower K — each chunk is more complete
High document diversity	Need higher K — relevant info spread across many docs
Narrow, focused corpus	Lower K sufficient — less noise
Multi-hop questions	Need higher K — must retrieve multiple reasoning steps
Single-fact lookup	K=3 is often enough
Using reranker	Retrieve more (K=50), final context still small
No reranker	Smaller K — every chunk goes to LLM, precision matters

Context Window Math

# Context budget calculation
model_context = 8192   # tokens (e.g., gpt-4o-mini)
system_prompt = 400
user_query    = 100
output_budget = 800
chunk_size    = 512

available_for_context = model_context - system_prompt - user_query - output_budget
# = 8192 - 400 - 100 - 800 = 6892 tokens

max_k = available_for_context // chunk_size
# = 6892 // 512 = 13 chunks maximum

# Practical K: 8–10 (leave buffer)
# With 128k context: K can be 50–100

Dynamic K — Adapting at Query Time

Don't use a fixed K for all queries. Use confidence-based dynamic K: retrieve more if early results have low similarity scores.

def dynamic_k_retrieve(query, base_k=5, max_k=20):
    results = vector_search(query, top_k=base_k)
    avg_score = mean([r.score for r in results])
    if avg_score < 0.75:    # low confidence, get more
        results = vector_search(query, top_k=max_k)
    return results

✓ Practical K Defaults (No Reranker)

Small corpus (<1k chunks), 512-token chunks → K=5.
Medium corpus (1k–50k chunks), 512-token chunks → K=8–10.
Large corpus + small chunks (128 tokens) → K=15–20.

✓ Practical K Defaults (With Reranker)

K_retrieve = 50 (always), K_context = 5–8 after reranking. This gives you the best of both worlds: high recall in retrieval, high precision in context.

💬 Interview Q

"We went from K=5 to K=20 and answer quality went DOWN. Why?" → More noise. With K=20 and no reranker, chunks 6–20 are probably not relevant and they're diluting the LLM's focus. The LLM is confused by conflicting/off-topic information. Fix: add a reranker. Or raise K for retrieval but only pass top 5 to the LLM.