🤖 Building Production-Grade Multi-Agent AI Systems

A comprehensive guide based on the Smart Claims Processor case study — covering LangGraph orchestration, CrewAI sub-crews, memory systems, HITL patterns, and production best practices.

System Overview

The Smart Claims Processor is a production-grade multi-agent system that processes insurance claims end-to-end. It demonstrates key patterns for building reliable AI systems:

🎯 Core Technologies

LangGraph — State machine orchestration
CrewAI — Multi-agent sub-crews
ChromaDB — Vector memory
FastAPI — REST backend
React — Frontend dashboard

📊 System Stats

7 specialized agents
5 distinct pipeline paths
3 memory tiers
Multiple HITL checkpoints
Country-aware configuration

The 7 Agents

Agent	Role	LLM vs Rules	Confidence Threshold
Intake	Validate & mask PII	Mostly rules	0.55
Fraud Crew	3 specialists + manager	LLM-heavy	0.50
Damage	Assess severity, depreciate	LLM + deterministic math	0.60
Policy	Coverage & exclusions	DB lookup + LLM reasoning	0.60
Settlement	Calculate final payout	LLM validation + Python math	0.65
Evaluator	Grade pipeline output	LLM (separate model)	0.70 composite
Communication	Generate claimant message	LLM	N/A (always runs)

The 5 Pipeline Paths

Path A: Normal (Happy Path)

Intake → Fraud → Damage → Policy → Settlement → Evaluator → Communication

All 7 agents run sequentially. Triggers when fraud is low (< 0.45), confidence is high, and no exclusions hit.

Path B: HITL (Human Review)

Any Agent → [PAUSE] → Reviewer Decision → Resume → Communication

Triggers when: fraud 0.45–0.90, confidence < threshold, evaluator score < 0.70, or high-value claim.

Path C: Auto-Reject (Confirmed Fraud)

Intake → Fraud (≥ 0.90) → Communication

Bypasses all other agents. Used for clear fraud patterns.

Path D: Invalid (Intake Fail)

Intake [FAIL] → Communication

Missing fields, lapsed policy, or incident date out of range.

Path E: Fast Mode (< $500)

Intake → Fraud → Settlement → Evaluator → Communication

Skips Damage and Policy checks for low-value claims with clean history.

Key Design Principles

🎯 Principle 1: LLMs Reason, Code Computes

Never trust an LLM with arithmetic that has financial consequences. Use LLMs for judgment (severity assessment, exclusion reasoning) and Python for math (depreciation, settlement formulas).

🔗 Principle 2: Agents Chain Through State

Agents don't call each other directly. They all read/write from a shared ClaimState TypedDict. This prevents tight coupling and makes the system composable.

🚦 Principle 3: Confidence Gates Everywhere

Every agent outputs a confidence score. Multiple checkpoints can trigger HITL review. This creates a safety net where uncertainty triggers human oversight.

💾 Principle 4: Durable State via Checkpoints

LangGraph's SqliteSaver persists state to disk. The pipeline survives server restarts. HITL pauses can last hours or days — state never gets lost.

Agent 1: Intake Agent

Role: The gatekeeper. Validates required fields, checks policy status, and masks PII before any data reaches downstream agents.

What It Does

Validates required fields (name, policy number, incident date, claim amount)
Looks up policy in database, checks if active
Verifies incident date falls within policy coverage period
Confirms claim type matches policy coverage types
Masks PII (SSN, Aadhaar, phone numbers) before data flows downstream
Sets pipeline_path to "fast" if amount < $500

LLM vs Rules

This agent is mostly rule-based. LLMs are optional here, only used if you need to extract structured data from free-text claim descriptions.

⚠️ When to Use LLM in Intake

Only when claimant submits unstructured text like: "My Honda was rear-ended last Tuesday, bumper damage, shop quoted $4,200"

The LLM extracts: claim_type=auto, amount=4200, damage_type=rear_end_collision

If the frontend already collects structured data, skip the LLM entirely.

PII Masking (Critical)

This happens before any downstream agent sees the data. Use country-specific regex patterns:

PII_PATTERNS = {
    "us": [
        (r"\b\d{3}-\d{2}-\d{4}\b", "***-**-****"),   # SSN
        (r"\b[A-Z]\d{7}\b", "[DL-MASKED]"),          # Driver license
        (r"\b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b", "***-***-****"),  # Phone
    ],
    "india": [
        (r"\b\d{4}\s\d{4}\s\d{4}\b", "**** **** ****"),  # Aadhaar
        (r"\b[A-Z]{5}\d{4}[A-Z]\b", "[PAN-MASKED]"),     # PAN
        (r"\b[6-9]\d{9}\b", "**********"),                # Mobile
    ],
}

Output Structure

{
    "intake_passed": true,
    "masked_data": {
        "claimant_name": "John Doe [PII-MASKED]",
        "description": "Car accident. My number is ***-***-****",
        "claim_amount": 4200.0
    },
    "pipeline_path": "normal",  # or "fast" if amount < $500
    "confidence_scores": {
        "intake": 0.95
    }
}

✅ Best Practice: Fail Fast, Fail Clear

If intake fails, immediately route to Communication Agent with a specific denial reason:

"Missing required field: incident_date"
"Policy POL-001 is not active"
"Incident date 2024-06-15 is outside policy coverage period"

Never send a vague "your claim cannot be processed" message.

Agent 2: Fraud Detection Crew

Role: Multi-agent fraud analysis using CrewAI. Three specialist agents investigate the claim, a manager synthesizes their findings into one fraud score.

The Crew Structure

🔍 Pattern Analyst

Tool: search_fraud_patterns()

Checks claim against known fraud schemes: staged accidents, phantom passengers, inflated invoices, pre-existing damage.

📊 Anomaly Detector

Tool: check_statistical_outlier()

Computes z-score vs benchmark. Flags if repair cost is 2+ standard deviations from the mean for this damage type.

🕵️ Social Validator

Tool: check_claimant_history()

Cross-references claimant's prior claims. Flags multiple claims in short periods or claims across providers.

👔 Manager Agent

No tools — reads all 3 reports

Synthesizes findings into final fraud score (0.0–1.0) with reasoning and flags.

CrewAI Configuration

crew = Crew(
    agents=[pattern_analyst, anomaly_detector, 
            social_validator, manager_agent],
    tasks=[pattern_task, anomaly_task, social_task, synthesis_task],
    process=Process.sequential,  # each feeds into next
    verbose=True
)

# synthesis_task has context=[pattern_task, anomaly_task, social_task]
# This is how the manager reads all three reports

Routing Logic

Fraud Score	Action	Next Node
< 0.45	Low risk, continue pipeline	Damage Assessor (or Settlement if fast mode)
0.45 – 0.89	Suspicious, pause for review	HITL Checkpoint
≥ 0.90	Confirmed fraud, auto-reject	Communication Agent

Why CrewAI for Fraud?

Fraud detection benefits from multiple perspectives. A single LLM call might miss patterns that a team of specialists catches. CrewAI's crew abstraction with a manager maps perfectly to this — three analysts submit reports, the manager synthesizes.

You could do this with three separate LangGraph nodes, but CrewAI's built-in delegation and debate mechanisms produce better fraud scores with less code.

Tool Design Pattern

Notice the tools return structured JSON, not prose. The LLM reads the JSON and incorporates it into its reasoning:

@tool("check_statistical_outlier")
def check_statistical_outlier(claim_type: str, amount: float) -> str:
    benchmarks = {"auto": {"mean": 4000, "std": 1500}}
    b = benchmarks.get(claim_type)
    z_score = (amount - b["mean"]) / b["std"]
    
    return json.dumps({
        "z_score": round(z_score, 2),
        "is_outlier": abs(z_score) > 2,
        "benchmark_mean": b["mean"]
    })

⚠️ Common Mistake: Letting LLM Query Database

Never give the LLM direct database access. Always have a Python tool that runs the query and returns structured data. The LLM reasons about the data, not about SQL.

Output Structure

{
    "fraud_score": 0.23,
    "decision": "low_risk",
    "reasoning": "No known patterns matched. Claim amount within normal range...",
    "flags": ["low_risk"],
    "pattern_matches": [],
    "anomaly_detected": false,
    "history_clean": true
}

Agent 3: Damage Assessor

Role: Assess damage severity and estimate repair cost. Apply country-specific depreciation rules deterministically.

The Pattern: LLM Judges, Python Computes

🤖 LLM Responsibilities

Assess damage severity (low/medium/high/total_loss)
Estimate raw pre-depreciation repair cost
Call search_similar_claims() tool to calibrate
Provide reasoning and flags

🐍 Python Responsibilities

Apply depreciation formula (NOT left to LLM)
Use country-specific depreciation tables
Compute final post-depreciation cost
Handle edge cases (vehicle age, part types)

Depreciation by Country

US: Year-Based Depreciation

def apply_depreciation_us(raw_cost: float, vehicle_age_years: int) -> dict:
    rates = {1: 0.20, 2: 0.15, 3: 0.12, 4: 0.10}
    rate = rates.get(vehicle_age_years, 0.08)  # 8% for 5+ years
    
    depreciation = raw_cost * rate
    final = raw_cost - depreciation
    
    return {
        "raw_cost": raw_cost,
        "depreciation_rate": rate,
        "depreciation_amount": round(depreciation, 2),
        "final_cost": round(final, 2),
        "method": "year_based"
    }

India: IRDAI Part-Wise Depreciation

def apply_depreciation_india(parts: dict) -> dict:
    # parts = {"rubber": 2000, "metal": 8000, "glass": 1500}
    rates = {"rubber": 0.50, "metal": 0.05, "glass": 0.00}
    
    total_raw = 0
    total_after = 0
    
    for part, cost in parts.items():
        rate = rates.get(part, 0.10)
        after = cost * (1 - rate)
        total_raw += cost
        total_after += after
    
    return {
        "total_raw_cost": round(total_raw, 2),
        "total_after_dep": round(total_after, 2),
        "method": "part_wise_irdai"
    }

Memory Integration

The LLM can call search_similar_claims() to calibrate its estimate against historical data:

@tool("search_similar_claims")
def search_similar_claims(claim_type: str, description: str) -> str:
    # ChromaDB semantic search (covered in memory section)
    results = long_term_collection.query(
        query_embeddings=[embed_text(description)],
        n_results=5
    )
    
    avg_settlement = sum(r["settlement"] for r in results) / len(results)
    
    return json.dumps({
        "similar_claims_found": len(results),
        "average_settlement": avg_settlement,
        "common_severity": "medium"
    })

✅ Why This Pattern Works

On day 1 with zero historical claims, the LLM relies on its training data. By day 100 with 500 completed claims, it searches memory and finds: "Similar 2020 Honda Civics averaged $4,200". The system gets smarter over time.

Confidence Gate

If confidence < 0.60, the pipeline pauses for HITL review. Common reasons:

Damage description is vague or contradictory
No similar claims found in memory
Estimate significantly differs from claimed amount

Output Structure

{
    "llm_assessment": {
        "severity": "medium",
        "raw_cost_estimate": 4200.0,
        "confidence": 0.88,
        "reasoning": "Front-end collision with bumper and hood damage...",
        "vehicle_age_years": 3
    },
    "after_depreciation": {
        "method": "year_based",
        "depreciation_rate": 0.15,
        "depreciation_amount": 630.0,
        "final_cost": 3570.0
    }
}

Agent 4: Policy Checker

Role: Determine if the claim is covered under the policy. Check exclusions, apply deductible, calculate maximum payout.

The Pattern: Python Fetches Facts, LLM Reasons

Policy rules are structured data (coverage limits, deductibles, exclusions). Python fetches those facts from the database. The LLM then applies judgment:

"Does this flood damage claim trigger the water damage exclusion?"
"Is pre-existing damage clearly documented?"
"Should this edge case be escalated?"

Tool 1: Get Policy Details

@tool("get_policy_details")
def get_policy_details(policy_number: str, claim_type: str) -> str:
    policy = POLICY_DB.get(policy_number)
    
    return json.dumps({
        "coverage_limit": policy["limits"].get(claim_type, 0),
        "deductible": policy["deductibles"].get(claim_type, 0),
        "exclusions": policy["exclusions"],
        "prior_claims": policy["prior_approved_claims"],
        "total_prior_paid": sum(c["amount"] for c in policy["prior_claims"])
    })

Tool 2: Check Remaining Coverage

@tool("check_remaining_coverage")
def check_remaining_coverage(policy_number: str, 
                              claim_type: str,
                              damage_estimate: float) -> str:
    policy = POLICY_DB.get(policy_number)
    
    limit = policy["limits"].get(claim_type, 0)
    deductible = policy["deductibles"].get(claim_type, 0)
    prior_paid = sum(c["amount"] for c in policy["prior_claims"])
    
    remaining_limit = limit - prior_paid
    payable_damage = max(0, damage_estimate - deductible)
    max_payout = min(payable_damage, remaining_limit)
    
    return json.dumps({
        "remaining_limit": remaining_limit,
        "deductible": deductible,
        "max_payout": round(max_payout, 2),
        "coverage_exhausted": remaining_limit <= 0
    })

LLM Reasoning Prompt

The system prompt explicitly guides exclusion reasoning:

Exclusion Reasoning Rules

Apply exclusions strictly but fairly
"flood_damage" exclusion: only applies if primary cause is flooding
"pre_existing_damage": only if damage clearly predates the incident
When in doubt, flag for human review rather than denying

This creates a bias toward the claimant — wrongful denial is worse than human review.

Routing Logic

Condition	Next Node
Ineligible (exclusion hit)	Communication Agent (denial)
Confidence < 0.60	HITL Checkpoint
Eligible	Settlement Calculator

Output Structure

{
    "eligible": true,
    "max_payout": 8000.0,
    "deductible": 500.0,
    "exclusions_hit": [],
    "coverage_limit": 15000.0,
    "remaining_coverage": 13800.0,
    "confidence": 0.85,
    "reasoning": "Claim falls within auto coverage. No exclusions apply...",
    "flags": []
}

⚠️ Critical: Cite Specific Policy Clauses

When denying for an exclusion, the LLM must cite the exact policy section:

"Your policy (POL-001) includes a flood damage exclusion under Section 4.2(c). The incident was caused by rising floodwater, which falls under this exclusion."

This is both a UX requirement and a legal one in insurance.

Agent 5: Settlement Calculator

Role: Compute the final settlement amount using country-specific formulas. The LLM validates inputs, Python does the math.

Settlement Formulas

US: Damage-Based (115% Cap)

settlement = min(
    damage_after_depreciation,
    damage_after_depreciation * 1.15,  # 115% buffer
    remaining_coverage_limit
) - deductible

The 115% cap is a real insurance industry rule — allows slight over-assessment buffer for unforeseen costs.

India: IDV-Based (100% Cap)

settlement = min(
    idv,  # Insured Declared Value
    remaining_coverage_limit
) - deductible

No buffer. IRDAI regulations cap at exactly 100% of IDV.

LLM's Role: Validator, Not Calculator

By this point in the pipeline, you have structured numbers from prior agents. The LLM's job flips from reasoner to validator:

What the LLM Checks

Are the inputs from prior agents consistent?
Does damage estimate vs claim amount make sense?
Is max_payout realistic given policy limits?
Should any inputs be overridden due to obvious errors?

Tool: Verify Consistency

@tool("verify_settlement_consistency")
def verify_settlement_consistency(damage_estimate: float,
                                   policy_max_payout: float,
                                   claim_amount: float) -> str:
    ratio = damage_estimate / claim_amount if claim_amount > 0 else 0
    flags = []
    
    if ratio > 1.5:
        flags.append("damage_estimate_exceeds_claim_by_50pct")
    if ratio < 0.2:
        flags.append("damage_estimate_very_low_vs_claim")
    if damage_estimate > policy_max_payout * 2:
        flags.append("damage_far_exceeds_coverage")
    
    return json.dumps({
        "damage_to_claim_ratio": round(ratio, 2),
        "flags": flags,
        "looks_consistent": len(flags) == 0
    })

The Override Mechanism

If the LLM spots that a prior agent's output is clearly wrong, it can set override_damage_estimate before the Python formula runs:

{
    "inputs_validated": false,
    "confidence": 0.45,
    "reasoning": "Damage estimate of $50,000 for minor scratch is clearly wrong...",
    "flags": ["damage_estimate_unrealistic"],
    "override_damage_estimate": 800.0  # LLM's corrected estimate
}

This is a rare but important escape hatch for obvious errors.

Confidence Threshold: 0.65 (Higher Than Other Agents)

Settlement is the last financial decision before payout. It demands more certainty. If confidence < 0.65, route to HITL.

✅ Best Practice: Show Your Work

The output should include a clear breakdown so the claimant (and auditors) can verify the math:

{
    "method": "damage_based_us",
    "damage_estimate": 4200.0,
    "damage_cap_115": 4830.0,
    "deductible": 500.0,
    "settlement_amount": 3700.0,
    "currency": "USD"
}

Agent 6: Evaluator (LLM-as-Judge)

Role: Quality gate. A separate LLM grades the entire pipeline's output on 5 dimensions before the claim is released.

Why This Agent Exists

Without systematic evaluation, you have no way to know if your pipeline is producing good outputs. Manual spot-checking doesn't scale.

The evaluator runs on every claim (or a sample) and gives you quantitative scores you can track over time. If the composite score drops from 0.85 to 0.72 over a week, you know something broke.

The 5 Scoring Dimensions

Dimension	What It Measures	Maps To
Accuracy	Are the numbers correct? Does settlement = damage - deductible?	Auditability
Completeness	Were all relevant factors considered? Any missing steps?	Thoroughness
Fairness	No demographic bias? Similar claims treated similarly?	Anti-discrimination laws
Safety	No harmful recommendations? Protects claimant interests?	Consumer protection
Transparency	Is reasoning clear? Can claimant understand the decision?	Explainability mandates

Each dimension gets a score 0.0–1.0. Composite score = average of 5 dimensions.

The Quality Gate

Pass Threshold: 0.70

If composite score ≥ 0.70 → Continue to Communication Agent

If composite score < 0.70 → Pause for HITL review

Use a Different LLM

Critical pattern: the evaluator uses a separate model than the pipeline agents:

# Pipeline agents use this
pipeline_llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

# Evaluator uses this (different model, lower temperature)
evaluator_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash-lite",
    temperature=0.3  # lower temp for consistency
)

Why? Avoids bias where a model grades its own outputs favorably. The evaluator should be independent.

Lower Temperature for Consistency

Evaluator runs at temperature=0.3 (vs 0.7–1.0 for reasoning agents). You want consistent scoring — two identical claims should get the same evaluation score.

The Evaluator Is Read-Only

The evaluator never modifies the settlement amount, fraud score, or any decision. It just says "this looks good" or "this needs human review."

Outputs from Agents 1-5 flow through unchanged. This separation of concerns makes the system auditable.

Batch Sampling for Ongoing Monitoring

The doc mentions the evaluator runs on 10% of auto-processed claims as a batch job. Over time you track:

"Are our Transparency scores dropping?"
"Is Fairness degrading?"
"Did the Accuracy score change after we updated the depreciation logic?"

This is how you detect model drift before customers complain.

Output Structure

{
    "accuracy": 0.90,
    "completeness": 0.85,
    "fairness": 0.92,
    "safety": 0.95,
    "transparency": 0.78,
    "composite_score": 0.88,
    "passed": true,
    "critical_flags": [],
    "reasoning": "All numbers check out. Reasoning is clear..."
}

Agent 7: Communication Agent

Role: Final agent. Generates the message sent to the claimant. Every pipeline path ends here.

Message Requirements by Outcome

Outcome	Message Must Include
Approved	Settlement amount, breakdown (damage - deductible = payout), payment timeline, next steps
Denied	Specific reason (never generic), policy clause cited, appeal rights, timeline
HITL Pending	Review in progress, expected timeline, reviewer contact info
Auto-Rejected (Fraud)	Specific red flags (not accusatory), investigation notice, appeal process

The Golden Rule: Never Generic Denials

❌ Bad Denial Message

"Your claim has been denied."

✅ Good Denial Message

"Your policy (POL-001) includes a flood damage exclusion under Section 4.2(c). The incident on June 15 was caused by rising floodwater from the nearby river, which falls under this exclusion. Water damage from internal sources (burst pipes, appliance leaks) would be covered, but external flooding is specifically excluded."

The LLM extracts the actual reason from the pipeline trace and explains it clearly. This is both a UX and legal requirement in insurance.

Country-Specific Regulatory Footers

Every message gets a regulatory footer appended (not optional):

REGULATORY_FOOTERS = {
    "us": """
---
If you disagree with this decision, you have the right to appeal.
Contact your State Insurance Commissioner:
https://content.naic.org/state-insurance-departments

For questions, call 1-800-CLAIMS-1 (M-F 9am-5pm ET)
Reference claim ID: {claim_id}
""",
    "india": """
---
यदि आप इस निर्णय से असहमत हैं, तो आपको अपील करने का अधिकार है।
Contact IRDAI Grievance Redressal:
https://www.irdai.gov.in

For questions, call 1800-425-4732 (M-F 10am-6pm IST)
Reference claim ID: {claim_id}
"""
}

Tone Adaptation

The system prompt tells the LLM explicitly:

Approved: Positive, clear about next steps
Denied: Empathetic but firm, always explain reason
HITL: Reassuring, set clear expectations
Fraud: Formal, serious, explain appeal rights

This prevents inappropriate tone (e.g., being cheerful about a denial).

Example: Approval Message

Dear John Doe,

Good news! Your auto claim (CLM-2024-001) has been approved.

Claim breakdown:
- Damage assessment:  $4,200.00
- Depreciation (15%): -$630.00
- Adjusted damage:     $3,570.00
- Deductible:          -$500.00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Settlement amount:     $3,070.00

Payment will be processed within 3-5 business days via direct
deposit. You'll receive confirmation once complete.

Next steps:
1. Schedule repairs with your preferred shop
2. Forward final invoice to claims@insurance.com
3. Keep your claim ID handy: CLM-2024-001

---
[Regulatory footer]

Example: Fraud Rejection Message

Dear John Doe,

Your claim (CLM-2024-089) has been flagged and cannot be processed.

Our fraud detection identified these concerns:
- Claim matches a known staged accident pattern
- Repair estimate is 280% above benchmark for this damage
- This is your third claim in 6 months across two providers

Investigation notice:
Your claim has been referred to our Special Investigations Unit.
You will be contacted within 10 business days.

You have the right to provide additional documentation.
Contact: fraud-review@insurance.com

---
[Regulatory footer]

LangGraph Orchestration

LangGraph is the glue that connects all 7 agents into a working pipeline. It turns a collection of functions into a state machine with durable checkpoints and conditional routing.

Core LangGraph Concepts

Concept	What It Does
StateGraph	Defines nodes (agents) and edges (connections)
Nodes	Python functions: `(state) → updated_state`
Edges	Connections between nodes, can be conditional
State (TypedDict)	Shared data structure flowing through pipeline
Checkpoints	Durable state snapshots (survives crashes)
interrupt()	Pauses execution for HITL, resume later

The State Schema

Every agent reads from and writes to this shared state:

class ClaimState(TypedDict):
    # Input data
    claim_id: str
    claimant_name: str
    policy_number: str
    claim_type: str
    incident_date: str
    claim_amount: float
    description: str
    country: str
    
    # Intake agent
    masked_data: dict
    intake_passed: bool
    intake_denial_reason: str | None
    
    # Fraud agent
    fraud_score: float
    
    # Routing
    pipeline_path: str  # "normal" | "hitl" | "auto_reject" | "invalid" | "fast"
    
    # All agent outputs
    agent_outputs: dict  # {agent_name: output_dict}
    confidence_scores: dict  # {agent_name: confidence_float}
    
    # HITL
    hitl_ticket: dict
    pipeline_status: str

Building the Graph

workflow = StateGraph(ClaimState)

# Add all nodes
workflow.add_node("intake", intake_agent_node)
workflow.add_node("fraud", fraud_detection_node)
workflow.add_node("damage", damage_assessor_node)
workflow.add_node("policy", policy_checker_node)
workflow.add_node("settlement", settlement_calculator_node)
workflow.add_node("evaluator", evaluator_node)
workflow.add_node("communication", communication_agent_node)
workflow.add_node("hitl", hitl_checkpoint_node)

# Set entry point
workflow.set_entry_point("intake")

Conditional Edges (The Routing Logic)

Every edge is a function that reads state and returns the next node name:

def route_after_fraud(state: ClaimState) -> str:
    score = state.get("fraud_score", 0.0)
    path = state.get("pipeline_path", "normal")
    
    if score >= 0.90:
        return "communication"  # auto-reject
    elif score >= 0.45:
        return "hitl"           # pause for review
    elif path == "fast":
        return "settlement"     # skip damage + policy
    else:
        return "damage"         # normal path

workflow.add_conditional_edges(
    "fraud",
    route_after_fraud,
    {
        "damage": "damage",
        "settlement": "settlement",
        "communication": "communication",
        "hitl": "hitl"
    }
)

The 5 Pipeline Paths (Emergent Behavior)

There's no explicit "path A" or "path B" code. The paths emerge from routing logic:

Path A: Normal

intake → fraud → damage → policy → settlement → evaluator → communication

All 7 agents run. Fraud < 0.45, all confidence gates pass.

Path B: HITL

any_agent → hitl → [PAUSE] → resume → next_agent → communication

Triggers when: fraud 0.45–0.90, confidence < threshold, or eval < 0.70.

Path C: Auto-Reject

intake → fraud (≥0.90) → communication

Confirmed fraud. Bypasses all other agents.

Path D: Invalid

intake [FAIL] → communication

Missing fields, lapsed policy, or date out of range.

Path E: Fast Mode

intake → fraud → settlement → evaluator → communication

Amount < $500. Skips damage and policy checks.

Durable Checkpoints with SqliteSaver

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("claims_checkpoints.db")

graph = workflow.compile(
    checkpointer=checkpointer,
    interrupt_before=["hitl"]  # pause before HITL node
)

What this gives you:

If server crashes mid-claim, state is in SQLite — resume on restart
HITL pauses can last hours/days — state never lost
thread_id = claim_id — each claim has its own checkpoint

HITL Interrupt/Resume Flow

Processing a Claim (First Time)

def process_claim(claim_data: dict) -> dict:
    config = {"configurable": {"thread_id": claim_data["claim_id"]}}
    result = graph.invoke(claim_data, config)
    return result

Resuming After HITL Review

from langgraph.types import Command

def resume_hitl_claim(claim_id: str, reviewer_decision: dict) -> dict:
    config = {"configurable": {"thread_id": claim_id}}
    
    # Command tells LangGraph to resume with new data
    result = graph.invoke(
        Command(resume=reviewer_decision),
        config
    )
    return result

Key Design Decisions

✅ Why LangGraph Over LangChain AgentExecutor?

LangChain's AgentExecutor is great for simple tool-calling loops but falls apart for complex workflows with:

Conditional routing (fraud score determines path)
Parallel execution (future: run damage + policy concurrently)
Durable state (checkpoints)
Multiple HITL gates

LangGraph gives you explicit control over the graph topology. You define exactly which node runs after which, under what conditions.

Memory System (3-Tier Architecture)

Memory is what lets agents learn from past claims instead of treating every claim as the first one they've ever seen.

The 3 Tiers

Tier	Storage	What It Holds	Lifetime
1. Short-Term	LangGraph State	Current claim data	One execution (+ checkpoint)
2. Long-Term	ChromaDB collection	All completed claim outcomes	Permanent (7-year audit)
3. Episodic	ChromaDB collection	Human overrides, fraud cases	Permanent (learning data)

Tier 1: Short-Term Memory

This is just the ClaimState TypedDict flowing through the pipeline. Every agent reads/writes to it. Exists only during processing (plus checkpoint persistence for HITL pauses).

Tier 2: Long-Term Memory (Historical Outcomes)

What Gets Stored

{
    "claim_id": "CLM-2023-045",
    "claim_type": "auto",
    "vehicle_type": "sedan",
    "damage_type": "front_end_collision",
    "incident_description": "Rear-ended at stoplight...",
    "final_settlement": 3800,
    "damage_severity": "medium",
    "fraud_score": 0.12,
    "decision": "approved",
    "timestamp": "2023-06-15T14:23:00Z"
}

How Agents Use It

Damage Assessor: "Similar Honda Civic collisions averaged $4,200"
Settlement Calculator: "15% depreciation claims resulted in $3,500–$4,000 payouts"
Evaluator: "This decision is consistent with 8 similar past approvals"

Tier 3: Episodic Memory (Special Events)

What Gets Stored

Human overrides: Reviewer changed an AI decision
Confirmed fraud: Investigation proved fraud
Quality failures: Evaluator score very low
Appeals won: Original decision was wrong

{
    "episode_type": "fraud_confirmed",
    "claim_id": "CLM-2023-089",
    "description": "Staged accident, phantom passenger scheme",
    "fraud_indicators": [
        "3_claims_in_6_months",
        "repair_shop_flagged",
        "witness_inconsistent"
    ],
    "outcome": "denied_after_investigation",
    "lesson": "Multiple claims + flagged shop = high fraud risk"
}

How Agents Use It

Fraud Crew: "Has this pattern appeared before? What was the outcome?"
Policy Checker: "Was this exclusion overridden by a reviewer in the past?"
All Agents: "Did past decisions like mine get corrected by humans?"

Technical Implementation

Embedding Model (Runs Locally)

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# 80MB, runs on CPU in ~50ms, no API calls

def embed_text(text: str) -> List[float]:
    return embedding_model.encode(text).tolist()

ChromaDB Setup

import chromadb

client = chromadb.PersistentClient(path="./chromadb_data")

long_term_collection = client.get_or_create_collection(
    name="long_term_claims"
)

episodic_collection = client.get_or_create_collection(
    name="episodic_memory"
)

Storing a Claim Outcome

def store_claim_outcome(claim_state: dict) -> None:
    description = f"""
    Claim type: {claim_state['claim_type']}
    Incident: {claim_state['masked_data']['description']}
    Severity: {damage_out.get('severity')}
    Settlement: ${settlement_out.get('amount')}
    """
    
    long_term_collection.add(
        ids=[claim_state["claim_id"]],
        embeddings=[embed_text(description)],
        documents=[description],
        metadatas=[{
            "claim_type": claim_state["claim_type"],
            "settlement_amount": settlement_out["amount"],
            "fraud_score": claim_state["fraud_score"]
        }]
    )

Searching Similar Claims (Tool)

@tool("search_similar_claims")
def search_similar_claims(query: str, claim_type: str = None) -> str:
    where_filter = {"claim_type": claim_type} if claim_type else None
    
    results = long_term_collection.query(
        query_embeddings=[embed_text(query)],
        n_results=5,
        where=where_filter
    )
    
    avg_settlement = sum(
        r["metadata"]["settlement_amount"] 
        for r in results
    ) / len(results)
    
    return json.dumps({
        "similar_claims_found": len(results),
        "average_settlement": round(avg_settlement, 2),
        "examples": [...]
    })

How Memory Improves Over Time

The Compounding Effect

Day 1: Zero historical claims. Damage Assessor relies on LLM training data only.

Day 100: 500 completed claims. Searches memory: "For 2020 Honda Civics with front-end damage, we settled 12 similar claims averaging $4,150."

Day 1000: 5000+ claims. Memory so rich that estimates are highly accurate from pattern matching alone.

Why Semantic Search Matters

Embeddings enable semantic similarity, not just keyword matching:

Stored claim: "Front-end collision, Honda Civic, bumper damage"

Search query: "Rear impact on sedan"

Result: ChromaDB finds them similar (both involve collision + sedan) even though exact words don't match.

Keyword search would miss this connection. Vector search captures it.

The Feedback Loop

When a reviewer overrides an AI decision, that gets stored in episodic memory. Next time a similar claim comes through:

Agent searches: "Settlement calculation for flood damage"
Episodic memory returns: "Last time I suggested denial for flood, 
reviewer approved it because the damage was from a burst pipe, 
not external flooding."

Agent adjusts confidence: 0.85 → 0.60 (triggers HITL)

Over time, the system learns what kinds of overrides happen and adjusts behavior.

Production Best Practices

1. Guardrails Are Not Optional

Guardrail	Default	Why It Matters
Max agent calls	25	Prevents infinite loops
Max tokens	50,000	Caps LLM usage per claim
Max cost	$0.50	Hard dollar limit
Max execution time	300s	Timeout for pipeline
Min confidence	0.60	Forces HITL if uncertain

All caps are configurable via environment variables. When a cap is hit, route to HITL with clear reason: "guardrail: token limit exceeded"

2. PII Masking Happens at the Edge

Critical Security Rule

The Intake Agent is the only agent that sees raw PII. Every downstream agent works with masked data. This creates a clear security boundary.

If an LLM logs or leaks data, it's already masked — no SSNs or Aadhaar numbers in the logs.

3. Confidence Thresholds Increase Over Pipeline

Notice the thresholds increase as you get closer to the payout decision:

Agent	Threshold	Reasoning
Intake	0.55	Mostly rules, low uncertainty
Fraud	0.50	Early in pipeline, can catch later
Damage	0.60	Financial estimate, needs confidence
Policy	0.60	Exclusions are serious
Settlement	0.65	Last financial decision
Evaluator	0.70	Final quality gate

4. Audit Trails for Regulatory Compliance

Insurance is heavily regulated. Every agent action gets logged with:

Timestamp
Input/output summaries
SHA256 hash for integrity verification
7-year retention period

Auditors can trace any decision back through the exact chain of agent actions.

5. Structured Outputs Are King

Every agent returns structured data (JSON), not prose. This makes the output:

Parseable: No regex on LLM prose
Testable: Unit tests check schema
Composable: Next agent reads prior output directly

## Good (structured)
{
    "fraud_score": 0.23,
    "decision": "low_risk",
    "reasoning": "...",
    "flags": []
}

## Bad (prose)
"The fraud analysis indicates this claim appears legitimate 
with a low risk score of approximately 0.23..."

6. Multiple HITL Gates Beat One

Don't just have one HITL gate at the end. Have multiple checkpoints throughout:

Fraud score 0.45–0.90
Any agent confidence < threshold
High-value claims (> $10K / Rs 5L)
Evaluator score < 0.70

This creates a safety net where uncertainty triggers oversight, not just end-of-pipeline checks.

7. Use Different LLMs for Different Roles

Use Case	Model Choice	Why
Pipeline agents	gemini-2.5-flash	Fast, reasoning-capable
Evaluator	gemini-2.5-flash-lite	Independent, lower temp (0.3)
Embeddings	all-MiniLM-L6-v2	Local, free, good quality

The evaluator uses a different model to avoid bias (model grading its own outputs).

8. Country-Aware Config Files

Don't hardcode country rules in Python. Use YAML files:

configs/
  base.yaml           # Shared settings
  countries/
    us.yaml           # USD, SSN/DL masking, year depreciation
    india.yaml        # INR, Aadhaar/PAN masking, IRDAI part-wise
    [future].yaml     # Add new countries without code changes

Adding a new country = creating a new YAML file. No code changes needed.

9. Token & Cost Tracking Per Claim

Use a LangChain callback handler to track tokens and cost:

from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = agent.invoke(state)
    
    print(f"Tokens: {cb.total_tokens}")
    print(f"Cost: ${cb.total_cost}")

Store this in the claim record. If a claim type consistently costs 10x average, investigate.

10. Fast Mode for Low-Value Claims

Real-world optimization: below a threshold ($500), full investigation costs more than the claim. The system auto-detects these and processes in seconds instead of minutes by skipping Damage + Policy agents.

Evaluation & Monitoring

Why LLM-as-Judge Works

You can't manually review every claim decision. You need automated quality evaluation.

LLM-as-Judge uses a separate LLM to grade the pipeline's output on objective dimensions. This gives you quantitative scores you can track over time.

The 5 Evaluation Dimensions

1. Accuracy (0.0–1.0)

Are the numbers correct?
Does settlement = damage - deductible?
Was depreciation applied correctly?

2. Completeness (0.0–1.0)

Were all relevant factors considered?
Was depreciation applied? Exclusions checked?
Any missing steps in the chain?

3. Fairness (0.0–1.0)

No demographic bias in reasoning?
Similar claims getting similar treatment?
Claimant not penalized unfairly?

4. Safety (0.0–1.0)

No harmful recommendations?
Decision protects claimant's interests?
No dangerous advice given?

5. Transparency (0.0–1.0)

Is reasoning clear and traceable?
Could the claimant understand the decision?
Is logic documented at each step?

Batch Sampling Strategy

Run the evaluator on:

100% of high-value claims (> $10K)
100% of denials
10% random sample of auto-approved claims

Track scores over time. If Transparency drops from 0.85 to 0.72, investigate.

Monitoring Metrics

Metric	What to Track	Alert Threshold
Evaluator composite score	Average per day/week	Drop > 0.10 from baseline
HITL rate	% of claims pausing	Spike > 2x baseline
Fraud detection rate	% with score > 0.45	Drop to near zero
Average cost per claim	LLM token cost	Increase > 50%
Pipeline completion time	Median time start to end	Increase > 2x
Agent confidence scores	Average by agent	Drop > 0.15

A/B Testing Agent Prompts

When you update an agent's prompt, run A/B test:

Route 10% of claims to new prompt
Compare evaluator scores: old vs new
Compare HITL rates: old vs new
If new prompt improves scores → roll out to 100%

Human Feedback Loop

When a reviewer overrides an AI decision:

Store the override in episodic memory
Flag the claim for re-evaluation
If many overrides on same agent → investigate prompt
Use overrides as training data for fine-tuning (future)

Common Pitfalls to Avoid