๐Ÿค– Building Production-Grade Multi-Agent AI Systems

A comprehensive guide based on the Smart Claims Processor case study โ€” covering LangGraph orchestration, CrewAI sub-crews, memory systems, HITL patterns, and production best practices.

System Overview

The Smart Claims Processor is a production-grade multi-agent system that processes insurance claims end-to-end. It demonstrates key patterns for building reliable AI systems:

๐ŸŽฏ Core Technologies

  • LangGraph โ€” State machine orchestration
  • CrewAI โ€” Multi-agent sub-crews
  • ChromaDB โ€” Vector memory
  • FastAPI โ€” REST backend
  • React โ€” Frontend dashboard

๐Ÿ“Š System Stats

  • 7 specialized agents
  • 5 distinct pipeline paths
  • 3 memory tiers
  • Multiple HITL checkpoints
  • Country-aware configuration

The 7 Agents

Agent Role LLM vs Rules Confidence Threshold
Intake Validate & mask PII Mostly rules 0.55
Fraud Crew 3 specialists + manager LLM-heavy 0.50
Damage Assess severity, depreciate LLM + deterministic math 0.60
Policy Coverage & exclusions DB lookup + LLM reasoning 0.60
Settlement Calculate final payout LLM validation + Python math 0.65
Evaluator Grade pipeline output LLM (separate model) 0.70 composite
Communication Generate claimant message LLM N/A (always runs)

The 5 Pipeline Paths

Path A: Normal (Happy Path)

Intake โ†’ Fraud โ†’ Damage โ†’ Policy โ†’ Settlement โ†’ Evaluator โ†’ Communication

All 7 agents run sequentially. Triggers when fraud is low (< 0.45), confidence is high, and no exclusions hit.

Path B: HITL (Human Review)

Any Agent โ†’ [PAUSE] โ†’ Reviewer Decision โ†’ Resume โ†’ Communication

Triggers when: fraud 0.45โ€“0.90, confidence < threshold, evaluator score < 0.70, or high-value claim.

Path C: Auto-Reject (Confirmed Fraud)

Intake โ†’ Fraud (โ‰ฅ 0.90) โ†’ Communication

Bypasses all other agents. Used for clear fraud patterns.

Path D: Invalid (Intake Fail)

Intake [FAIL] โ†’ Communication

Missing fields, lapsed policy, or incident date out of range.

Path E: Fast Mode (< $500)

Intake โ†’ Fraud โ†’ Settlement โ†’ Evaluator โ†’ Communication

Skips Damage and Policy checks for low-value claims with clean history.

Key Design Principles

๐ŸŽฏ Principle 1: LLMs Reason, Code Computes

Never trust an LLM with arithmetic that has financial consequences. Use LLMs for judgment (severity assessment, exclusion reasoning) and Python for math (depreciation, settlement formulas).

๐Ÿ”— Principle 2: Agents Chain Through State

Agents don't call each other directly. They all read/write from a shared ClaimState TypedDict. This prevents tight coupling and makes the system composable.

๐Ÿšฆ Principle 3: Confidence Gates Everywhere

Every agent outputs a confidence score. Multiple checkpoints can trigger HITL review. This creates a safety net where uncertainty triggers human oversight.

๐Ÿ’พ Principle 4: Durable State via Checkpoints

LangGraph's SqliteSaver persists state to disk. The pipeline survives server restarts. HITL pauses can last hours or days โ€” state never gets lost.

Agent 1: Intake Agent

Role: The gatekeeper. Validates required fields, checks policy status, and masks PII before any data reaches downstream agents.

What It Does

LLM vs Rules

This agent is mostly rule-based. LLMs are optional here, only used if you need to extract structured data from free-text claim descriptions.

โš ๏ธ When to Use LLM in Intake

Only when claimant submits unstructured text like: "My Honda was rear-ended last Tuesday, bumper damage, shop quoted $4,200"

The LLM extracts: claim_type=auto, amount=4200, damage_type=rear_end_collision

If the frontend already collects structured data, skip the LLM entirely.

PII Masking (Critical)

This happens before any downstream agent sees the data. Use country-specific regex patterns:

PII_PATTERNS = {
    "us": [
        (r"\b\d{3}-\d{2}-\d{4}\b", "***-**-****"),   # SSN
        (r"\b[A-Z]\d{7}\b", "[DL-MASKED]"),          # Driver license
        (r"\b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b", "***-***-****"),  # Phone
    ],
    "india": [
        (r"\b\d{4}\s\d{4}\s\d{4}\b", "**** **** ****"),  # Aadhaar
        (r"\b[A-Z]{5}\d{4}[A-Z]\b", "[PAN-MASKED]"),     # PAN
        (r"\b[6-9]\d{9}\b", "**********"),                # Mobile
    ],
}

Output Structure

{
    "intake_passed": true,
    "masked_data": {
        "claimant_name": "John Doe [PII-MASKED]",
        "description": "Car accident. My number is ***-***-****",
        "claim_amount": 4200.0
    },
    "pipeline_path": "normal",  # or "fast" if amount < $500
    "confidence_scores": {
        "intake": 0.95
    }
}

โœ… Best Practice: Fail Fast, Fail Clear

If intake fails, immediately route to Communication Agent with a specific denial reason:

  • "Missing required field: incident_date"
  • "Policy POL-001 is not active"
  • "Incident date 2024-06-15 is outside policy coverage period"

Never send a vague "your claim cannot be processed" message.

Agent 2: Fraud Detection Crew

Role: Multi-agent fraud analysis using CrewAI. Three specialist agents investigate the claim, a manager synthesizes their findings into one fraud score.

The Crew Structure

๐Ÿ” Pattern Analyst

Tool: search_fraud_patterns()

Checks claim against known fraud schemes: staged accidents, phantom passengers, inflated invoices, pre-existing damage.

๐Ÿ“Š Anomaly Detector

Tool: check_statistical_outlier()

Computes z-score vs benchmark. Flags if repair cost is 2+ standard deviations from the mean for this damage type.

๐Ÿ•ต๏ธ Social Validator

Tool: check_claimant_history()

Cross-references claimant's prior claims. Flags multiple claims in short periods or claims across providers.

๐Ÿ‘” Manager Agent

No tools โ€” reads all 3 reports

Synthesizes findings into final fraud score (0.0โ€“1.0) with reasoning and flags.

CrewAI Configuration

crew = Crew(
    agents=[pattern_analyst, anomaly_detector, 
            social_validator, manager_agent],
    tasks=[pattern_task, anomaly_task, social_task, synthesis_task],
    process=Process.sequential,  # each feeds into next
    verbose=True
)

# synthesis_task has context=[pattern_task, anomaly_task, social_task]
# This is how the manager reads all three reports

Routing Logic

Fraud Score Action Next Node
< 0.45 Low risk, continue pipeline Damage Assessor (or Settlement if fast mode)
0.45 โ€“ 0.89 Suspicious, pause for review HITL Checkpoint
โ‰ฅ 0.90 Confirmed fraud, auto-reject Communication Agent

Why CrewAI for Fraud?

Fraud detection benefits from multiple perspectives. A single LLM call might miss patterns that a team of specialists catches. CrewAI's crew abstraction with a manager maps perfectly to this โ€” three analysts submit reports, the manager synthesizes.

You could do this with three separate LangGraph nodes, but CrewAI's built-in delegation and debate mechanisms produce better fraud scores with less code.

Tool Design Pattern

Notice the tools return structured JSON, not prose. The LLM reads the JSON and incorporates it into its reasoning:

@tool("check_statistical_outlier")
def check_statistical_outlier(claim_type: str, amount: float) -> str:
    benchmarks = {"auto": {"mean": 4000, "std": 1500}}
    b = benchmarks.get(claim_type)
    z_score = (amount - b["mean"]) / b["std"]
    
    return json.dumps({
        "z_score": round(z_score, 2),
        "is_outlier": abs(z_score) > 2,
        "benchmark_mean": b["mean"]
    })

โš ๏ธ Common Mistake: Letting LLM Query Database

Never give the LLM direct database access. Always have a Python tool that runs the query and returns structured data. The LLM reasons about the data, not about SQL.

Output Structure

{
    "fraud_score": 0.23,
    "decision": "low_risk",
    "reasoning": "No known patterns matched. Claim amount within normal range...",
    "flags": ["low_risk"],
    "pattern_matches": [],
    "anomaly_detected": false,
    "history_clean": true
}

Agent 3: Damage Assessor

Role: Assess damage severity and estimate repair cost. Apply country-specific depreciation rules deterministically.

The Pattern: LLM Judges, Python Computes

๐Ÿค– LLM Responsibilities

  • Assess damage severity (low/medium/high/total_loss)
  • Estimate raw pre-depreciation repair cost
  • Call search_similar_claims() tool to calibrate
  • Provide reasoning and flags

๐Ÿ Python Responsibilities

  • Apply depreciation formula (NOT left to LLM)
  • Use country-specific depreciation tables
  • Compute final post-depreciation cost
  • Handle edge cases (vehicle age, part types)

Depreciation by Country

US: Year-Based Depreciation

def apply_depreciation_us(raw_cost: float, vehicle_age_years: int) -> dict:
    rates = {1: 0.20, 2: 0.15, 3: 0.12, 4: 0.10}
    rate = rates.get(vehicle_age_years, 0.08)  # 8% for 5+ years
    
    depreciation = raw_cost * rate
    final = raw_cost - depreciation
    
    return {
        "raw_cost": raw_cost,
        "depreciation_rate": rate,
        "depreciation_amount": round(depreciation, 2),
        "final_cost": round(final, 2),
        "method": "year_based"
    }

India: IRDAI Part-Wise Depreciation

def apply_depreciation_india(parts: dict) -> dict:
    # parts = {"rubber": 2000, "metal": 8000, "glass": 1500}
    rates = {"rubber": 0.50, "metal": 0.05, "glass": 0.00}
    
    total_raw = 0
    total_after = 0
    
    for part, cost in parts.items():
        rate = rates.get(part, 0.10)
        after = cost * (1 - rate)
        total_raw += cost
        total_after += after
    
    return {
        "total_raw_cost": round(total_raw, 2),
        "total_after_dep": round(total_after, 2),
        "method": "part_wise_irdai"
    }

Memory Integration

The LLM can call search_similar_claims() to calibrate its estimate against historical data:

@tool("search_similar_claims")
def search_similar_claims(claim_type: str, description: str) -> str:
    # ChromaDB semantic search (covered in memory section)
    results = long_term_collection.query(
        query_embeddings=[embed_text(description)],
        n_results=5
    )
    
    avg_settlement = sum(r["settlement"] for r in results) / len(results)
    
    return json.dumps({
        "similar_claims_found": len(results),
        "average_settlement": avg_settlement,
        "common_severity": "medium"
    })

โœ… Why This Pattern Works

On day 1 with zero historical claims, the LLM relies on its training data. By day 100 with 500 completed claims, it searches memory and finds: "Similar 2020 Honda Civics averaged $4,200". The system gets smarter over time.

Confidence Gate

If confidence < 0.60, the pipeline pauses for HITL review. Common reasons:

Output Structure

{
    "llm_assessment": {
        "severity": "medium",
        "raw_cost_estimate": 4200.0,
        "confidence": 0.88,
        "reasoning": "Front-end collision with bumper and hood damage...",
        "vehicle_age_years": 3
    },
    "after_depreciation": {
        "method": "year_based",
        "depreciation_rate": 0.15,
        "depreciation_amount": 630.0,
        "final_cost": 3570.0
    }
}

Agent 4: Policy Checker

Role: Determine if the claim is covered under the policy. Check exclusions, apply deductible, calculate maximum payout.

The Pattern: Python Fetches Facts, LLM Reasons

Policy rules are structured data (coverage limits, deductibles, exclusions). Python fetches those facts from the database. The LLM then applies judgment:

  • "Does this flood damage claim trigger the water damage exclusion?"
  • "Is pre-existing damage clearly documented?"
  • "Should this edge case be escalated?"

Tool 1: Get Policy Details

@tool("get_policy_details")
def get_policy_details(policy_number: str, claim_type: str) -> str:
    policy = POLICY_DB.get(policy_number)
    
    return json.dumps({
        "coverage_limit": policy["limits"].get(claim_type, 0),
        "deductible": policy["deductibles"].get(claim_type, 0),
        "exclusions": policy["exclusions"],
        "prior_claims": policy["prior_approved_claims"],
        "total_prior_paid": sum(c["amount"] for c in policy["prior_claims"])
    })

Tool 2: Check Remaining Coverage

@tool("check_remaining_coverage")
def check_remaining_coverage(policy_number: str, 
                              claim_type: str,
                              damage_estimate: float) -> str:
    policy = POLICY_DB.get(policy_number)
    
    limit = policy["limits"].get(claim_type, 0)
    deductible = policy["deductibles"].get(claim_type, 0)
    prior_paid = sum(c["amount"] for c in policy["prior_claims"])
    
    remaining_limit = limit - prior_paid
    payable_damage = max(0, damage_estimate - deductible)
    max_payout = min(payable_damage, remaining_limit)
    
    return json.dumps({
        "remaining_limit": remaining_limit,
        "deductible": deductible,
        "max_payout": round(max_payout, 2),
        "coverage_exhausted": remaining_limit <= 0
    })

LLM Reasoning Prompt

The system prompt explicitly guides exclusion reasoning:

Exclusion Reasoning Rules

  • Apply exclusions strictly but fairly
  • "flood_damage" exclusion: only applies if primary cause is flooding
  • "pre_existing_damage": only if damage clearly predates the incident
  • When in doubt, flag for human review rather than denying

This creates a bias toward the claimant โ€” wrongful denial is worse than human review.

Routing Logic

Condition Next Node
Ineligible (exclusion hit) Communication Agent (denial)
Confidence < 0.60 HITL Checkpoint
Eligible Settlement Calculator

Output Structure

{
    "eligible": true,
    "max_payout": 8000.0,
    "deductible": 500.0,
    "exclusions_hit": [],
    "coverage_limit": 15000.0,
    "remaining_coverage": 13800.0,
    "confidence": 0.85,
    "reasoning": "Claim falls within auto coverage. No exclusions apply...",
    "flags": []
}

โš ๏ธ Critical: Cite Specific Policy Clauses

When denying for an exclusion, the LLM must cite the exact policy section:

"Your policy (POL-001) includes a flood damage exclusion under Section 4.2(c). The incident was caused by rising floodwater, which falls under this exclusion."

This is both a UX requirement and a legal one in insurance.

Agent 5: Settlement Calculator

Role: Compute the final settlement amount using country-specific formulas. The LLM validates inputs, Python does the math.

Settlement Formulas

US: Damage-Based (115% Cap)

settlement = min(
    damage_after_depreciation,
    damage_after_depreciation * 1.15,  # 115% buffer
    remaining_coverage_limit
) - deductible

The 115% cap is a real insurance industry rule โ€” allows slight over-assessment buffer for unforeseen costs.

India: IDV-Based (100% Cap)

settlement = min(
    idv,  # Insured Declared Value
    remaining_coverage_limit
) - deductible

No buffer. IRDAI regulations cap at exactly 100% of IDV.

LLM's Role: Validator, Not Calculator

By this point in the pipeline, you have structured numbers from prior agents. The LLM's job flips from reasoner to validator:

What the LLM Checks

  • Are the inputs from prior agents consistent?
  • Does damage estimate vs claim amount make sense?
  • Is max_payout realistic given policy limits?
  • Should any inputs be overridden due to obvious errors?

Tool: Verify Consistency

@tool("verify_settlement_consistency")
def verify_settlement_consistency(damage_estimate: float,
                                   policy_max_payout: float,
                                   claim_amount: float) -> str:
    ratio = damage_estimate / claim_amount if claim_amount > 0 else 0
    flags = []
    
    if ratio > 1.5:
        flags.append("damage_estimate_exceeds_claim_by_50pct")
    if ratio < 0.2:
        flags.append("damage_estimate_very_low_vs_claim")
    if damage_estimate > policy_max_payout * 2:
        flags.append("damage_far_exceeds_coverage")
    
    return json.dumps({
        "damage_to_claim_ratio": round(ratio, 2),
        "flags": flags,
        "looks_consistent": len(flags) == 0
    })

The Override Mechanism

If the LLM spots that a prior agent's output is clearly wrong, it can set override_damage_estimate before the Python formula runs:

{
    "inputs_validated": false,
    "confidence": 0.45,
    "reasoning": "Damage estimate of $50,000 for minor scratch is clearly wrong...",
    "flags": ["damage_estimate_unrealistic"],
    "override_damage_estimate": 800.0  # LLM's corrected estimate
}

This is a rare but important escape hatch for obvious errors.

Confidence Threshold: 0.65 (Higher Than Other Agents)

Settlement is the last financial decision before payout. It demands more certainty. If confidence < 0.65, route to HITL.

โœ… Best Practice: Show Your Work

The output should include a clear breakdown so the claimant (and auditors) can verify the math:

{
    "method": "damage_based_us",
    "damage_estimate": 4200.0,
    "damage_cap_115": 4830.0,
    "deductible": 500.0,
    "settlement_amount": 3700.0,
    "currency": "USD"
}

Agent 6: Evaluator (LLM-as-Judge)

Role: Quality gate. A separate LLM grades the entire pipeline's output on 5 dimensions before the claim is released.

Why This Agent Exists

Without systematic evaluation, you have no way to know if your pipeline is producing good outputs. Manual spot-checking doesn't scale.

The evaluator runs on every claim (or a sample) and gives you quantitative scores you can track over time. If the composite score drops from 0.85 to 0.72 over a week, you know something broke.

The 5 Scoring Dimensions

Dimension What It Measures Maps To
Accuracy Are the numbers correct? Does settlement = damage - deductible? Auditability
Completeness Were all relevant factors considered? Any missing steps? Thoroughness
Fairness No demographic bias? Similar claims treated similarly? Anti-discrimination laws
Safety No harmful recommendations? Protects claimant interests? Consumer protection
Transparency Is reasoning clear? Can claimant understand the decision? Explainability mandates

Each dimension gets a score 0.0โ€“1.0. Composite score = average of 5 dimensions.

The Quality Gate

Pass Threshold: 0.70

If composite score โ‰ฅ 0.70 โ†’ Continue to Communication Agent

If composite score < 0.70 โ†’ Pause for HITL review

Use a Different LLM

Critical pattern: the evaluator uses a separate model than the pipeline agents:

# Pipeline agents use this
pipeline_llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

# Evaluator uses this (different model, lower temperature)
evaluator_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash-lite",
    temperature=0.3  # lower temp for consistency
)

Why? Avoids bias where a model grades its own outputs favorably. The evaluator should be independent.

Lower Temperature for Consistency

Evaluator runs at temperature=0.3 (vs 0.7โ€“1.0 for reasoning agents). You want consistent scoring โ€” two identical claims should get the same evaluation score.

The Evaluator Is Read-Only

The evaluator never modifies the settlement amount, fraud score, or any decision. It just says "this looks good" or "this needs human review."

Outputs from Agents 1-5 flow through unchanged. This separation of concerns makes the system auditable.

Batch Sampling for Ongoing Monitoring

The doc mentions the evaluator runs on 10% of auto-processed claims as a batch job. Over time you track:

This is how you detect model drift before customers complain.

Output Structure

{
    "accuracy": 0.90,
    "completeness": 0.85,
    "fairness": 0.92,
    "safety": 0.95,
    "transparency": 0.78,
    "composite_score": 0.88,
    "passed": true,
    "critical_flags": [],
    "reasoning": "All numbers check out. Reasoning is clear..."
}

Agent 7: Communication Agent

Role: Final agent. Generates the message sent to the claimant. Every pipeline path ends here.

Message Requirements by Outcome

Outcome Message Must Include
Approved Settlement amount, breakdown (damage - deductible = payout), payment timeline, next steps
Denied Specific reason (never generic), policy clause cited, appeal rights, timeline
HITL Pending Review in progress, expected timeline, reviewer contact info
Auto-Rejected (Fraud) Specific red flags (not accusatory), investigation notice, appeal process

The Golden Rule: Never Generic Denials

โŒ Bad Denial Message

"Your claim has been denied."

โœ… Good Denial Message

"Your policy (POL-001) includes a flood damage exclusion under Section 4.2(c). The incident on June 15 was caused by rising floodwater from the nearby river, which falls under this exclusion. Water damage from internal sources (burst pipes, appliance leaks) would be covered, but external flooding is specifically excluded."

The LLM extracts the actual reason from the pipeline trace and explains it clearly. This is both a UX and legal requirement in insurance.

Country-Specific Regulatory Footers

Every message gets a regulatory footer appended (not optional):

REGULATORY_FOOTERS = {
    "us": """
---
If you disagree with this decision, you have the right to appeal.
Contact your State Insurance Commissioner:
https://content.naic.org/state-insurance-departments

For questions, call 1-800-CLAIMS-1 (M-F 9am-5pm ET)
Reference claim ID: {claim_id}
""",
    "india": """
---
เคฏเคฆเคฟ เค†เคช เค‡เคธ เคจเคฟเคฐเฅเคฃเคฏ เคธเฅ‡ เค…เคธเคนเคฎเคค เคนเฅˆเค‚, เคคเฅ‹ เค†เคชเค•เฅ‹ เค…เคชเฅ€เคฒ เค•เคฐเคจเฅ‡ เค•เคพ เค…เคงเคฟเค•เคพเคฐ เคนเฅˆเฅค
Contact IRDAI Grievance Redressal:
https://www.irdai.gov.in

For questions, call 1800-425-4732 (M-F 10am-6pm IST)
Reference claim ID: {claim_id}
"""
}

Tone Adaptation

The system prompt tells the LLM explicitly:

This prevents inappropriate tone (e.g., being cheerful about a denial).

Example: Approval Message

Dear John Doe,

Good news! Your auto claim (CLM-2024-001) has been approved.

Claim breakdown:
- Damage assessment:  $4,200.00
- Depreciation (15%): -$630.00
- Adjusted damage:     $3,570.00
- Deductible:          -$500.00
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Settlement amount:     $3,070.00

Payment will be processed within 3-5 business days via direct
deposit. You'll receive confirmation once complete.

Next steps:
1. Schedule repairs with your preferred shop
2. Forward final invoice to claims@insurance.com
3. Keep your claim ID handy: CLM-2024-001

---
[Regulatory footer]

Example: Fraud Rejection Message

Dear John Doe,

Your claim (CLM-2024-089) has been flagged and cannot be processed.

Our fraud detection identified these concerns:
- Claim matches a known staged accident pattern
- Repair estimate is 280% above benchmark for this damage
- This is your third claim in 6 months across two providers

Investigation notice:
Your claim has been referred to our Special Investigations Unit.
You will be contacted within 10 business days.

You have the right to provide additional documentation.
Contact: fraud-review@insurance.com

---
[Regulatory footer]

LangGraph Orchestration

LangGraph is the glue that connects all 7 agents into a working pipeline. It turns a collection of functions into a state machine with durable checkpoints and conditional routing.

Core LangGraph Concepts

Concept What It Does
StateGraph Defines nodes (agents) and edges (connections)
Nodes Python functions: (state) โ†’ updated_state
Edges Connections between nodes, can be conditional
State (TypedDict) Shared data structure flowing through pipeline
Checkpoints Durable state snapshots (survives crashes)
interrupt() Pauses execution for HITL, resume later

The State Schema

Every agent reads from and writes to this shared state:

class ClaimState(TypedDict):
    # Input data
    claim_id: str
    claimant_name: str
    policy_number: str
    claim_type: str
    incident_date: str
    claim_amount: float
    description: str
    country: str
    
    # Intake agent
    masked_data: dict
    intake_passed: bool
    intake_denial_reason: str | None
    
    # Fraud agent
    fraud_score: float
    
    # Routing
    pipeline_path: str  # "normal" | "hitl" | "auto_reject" | "invalid" | "fast"
    
    # All agent outputs
    agent_outputs: dict  # {agent_name: output_dict}
    confidence_scores: dict  # {agent_name: confidence_float}
    
    # HITL
    hitl_ticket: dict
    pipeline_status: str

Building the Graph

workflow = StateGraph(ClaimState)

# Add all nodes
workflow.add_node("intake", intake_agent_node)
workflow.add_node("fraud", fraud_detection_node)
workflow.add_node("damage", damage_assessor_node)
workflow.add_node("policy", policy_checker_node)
workflow.add_node("settlement", settlement_calculator_node)
workflow.add_node("evaluator", evaluator_node)
workflow.add_node("communication", communication_agent_node)
workflow.add_node("hitl", hitl_checkpoint_node)

# Set entry point
workflow.set_entry_point("intake")

Conditional Edges (The Routing Logic)

Every edge is a function that reads state and returns the next node name:

def route_after_fraud(state: ClaimState) -> str:
    score = state.get("fraud_score", 0.0)
    path = state.get("pipeline_path", "normal")
    
    if score >= 0.90:
        return "communication"  # auto-reject
    elif score >= 0.45:
        return "hitl"           # pause for review
    elif path == "fast":
        return "settlement"     # skip damage + policy
    else:
        return "damage"         # normal path

workflow.add_conditional_edges(
    "fraud",
    route_after_fraud,
    {
        "damage": "damage",
        "settlement": "settlement",
        "communication": "communication",
        "hitl": "hitl"
    }
)

The 5 Pipeline Paths (Emergent Behavior)

There's no explicit "path A" or "path B" code. The paths emerge from routing logic:

Path A: Normal

intake โ†’ fraud โ†’ damage โ†’ policy โ†’ settlement โ†’ evaluator โ†’ communication

All 7 agents run. Fraud < 0.45, all confidence gates pass.

Path B: HITL

any_agent โ†’ hitl โ†’ [PAUSE] โ†’ resume โ†’ next_agent โ†’ communication

Triggers when: fraud 0.45โ€“0.90, confidence < threshold, or eval < 0.70.

Path C: Auto-Reject

intake โ†’ fraud (โ‰ฅ0.90) โ†’ communication

Confirmed fraud. Bypasses all other agents.

Path D: Invalid

intake [FAIL] โ†’ communication

Missing fields, lapsed policy, or date out of range.

Path E: Fast Mode

intake โ†’ fraud โ†’ settlement โ†’ evaluator โ†’ communication

Amount < $500. Skips damage and policy checks.

Durable Checkpoints with SqliteSaver

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("claims_checkpoints.db")

graph = workflow.compile(
    checkpointer=checkpointer,
    interrupt_before=["hitl"]  # pause before HITL node
)

What this gives you:

HITL Interrupt/Resume Flow

Processing a Claim (First Time)

def process_claim(claim_data: dict) -> dict:
    config = {"configurable": {"thread_id": claim_data["claim_id"]}}
    result = graph.invoke(claim_data, config)
    return result

Resuming After HITL Review

from langgraph.types import Command

def resume_hitl_claim(claim_id: str, reviewer_decision: dict) -> dict:
    config = {"configurable": {"thread_id": claim_id}}
    
    # Command tells LangGraph to resume with new data
    result = graph.invoke(
        Command(resume=reviewer_decision),
        config
    )
    return result

Key Design Decisions

โœ… Why LangGraph Over LangChain AgentExecutor?

LangChain's AgentExecutor is great for simple tool-calling loops but falls apart for complex workflows with:

  • Conditional routing (fraud score determines path)
  • Parallel execution (future: run damage + policy concurrently)
  • Durable state (checkpoints)
  • Multiple HITL gates

LangGraph gives you explicit control over the graph topology. You define exactly which node runs after which, under what conditions.

Memory System (3-Tier Architecture)

Memory is what lets agents learn from past claims instead of treating every claim as the first one they've ever seen.

The 3 Tiers

Tier Storage What It Holds Lifetime
1. Short-Term LangGraph State Current claim data One execution (+ checkpoint)
2. Long-Term ChromaDB collection All completed claim outcomes Permanent (7-year audit)
3. Episodic ChromaDB collection Human overrides, fraud cases Permanent (learning data)

Tier 1: Short-Term Memory

This is just the ClaimState TypedDict flowing through the pipeline. Every agent reads/writes to it. Exists only during processing (plus checkpoint persistence for HITL pauses).

Tier 2: Long-Term Memory (Historical Outcomes)

What Gets Stored

{
    "claim_id": "CLM-2023-045",
    "claim_type": "auto",
    "vehicle_type": "sedan",
    "damage_type": "front_end_collision",
    "incident_description": "Rear-ended at stoplight...",
    "final_settlement": 3800,
    "damage_severity": "medium",
    "fraud_score": 0.12,
    "decision": "approved",
    "timestamp": "2023-06-15T14:23:00Z"
}

How Agents Use It

Tier 3: Episodic Memory (Special Events)

What Gets Stored

{
    "episode_type": "fraud_confirmed",
    "claim_id": "CLM-2023-089",
    "description": "Staged accident, phantom passenger scheme",
    "fraud_indicators": [
        "3_claims_in_6_months",
        "repair_shop_flagged",
        "witness_inconsistent"
    ],
    "outcome": "denied_after_investigation",
    "lesson": "Multiple claims + flagged shop = high fraud risk"
}

How Agents Use It

Technical Implementation

Embedding Model (Runs Locally)

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# 80MB, runs on CPU in ~50ms, no API calls

def embed_text(text: str) -> List[float]:
    return embedding_model.encode(text).tolist()

ChromaDB Setup

import chromadb

client = chromadb.PersistentClient(path="./chromadb_data")

long_term_collection = client.get_or_create_collection(
    name="long_term_claims"
)

episodic_collection = client.get_or_create_collection(
    name="episodic_memory"
)

Storing a Claim Outcome

def store_claim_outcome(claim_state: dict) -> None:
    description = f"""
    Claim type: {claim_state['claim_type']}
    Incident: {claim_state['masked_data']['description']}
    Severity: {damage_out.get('severity')}
    Settlement: ${settlement_out.get('amount')}
    """
    
    long_term_collection.add(
        ids=[claim_state["claim_id"]],
        embeddings=[embed_text(description)],
        documents=[description],
        metadatas=[{
            "claim_type": claim_state["claim_type"],
            "settlement_amount": settlement_out["amount"],
            "fraud_score": claim_state["fraud_score"]
        }]
    )

Searching Similar Claims (Tool)

@tool("search_similar_claims")
def search_similar_claims(query: str, claim_type: str = None) -> str:
    where_filter = {"claim_type": claim_type} if claim_type else None
    
    results = long_term_collection.query(
        query_embeddings=[embed_text(query)],
        n_results=5,
        where=where_filter
    )
    
    avg_settlement = sum(
        r["metadata"]["settlement_amount"] 
        for r in results
    ) / len(results)
    
    return json.dumps({
        "similar_claims_found": len(results),
        "average_settlement": round(avg_settlement, 2),
        "examples": [...]
    })

How Memory Improves Over Time

The Compounding Effect

Day 1: Zero historical claims. Damage Assessor relies on LLM training data only.

Day 100: 500 completed claims. Searches memory: "For 2020 Honda Civics with front-end damage, we settled 12 similar claims averaging $4,150."

Day 1000: 5000+ claims. Memory so rich that estimates are highly accurate from pattern matching alone.

Why Semantic Search Matters

Embeddings enable semantic similarity, not just keyword matching:

Stored claim: "Front-end collision, Honda Civic, bumper damage"

Search query: "Rear impact on sedan"

Result: ChromaDB finds them similar (both involve collision + sedan) even though exact words don't match.

Keyword search would miss this connection. Vector search captures it.

The Feedback Loop

When a reviewer overrides an AI decision, that gets stored in episodic memory. Next time a similar claim comes through:

Agent searches: "Settlement calculation for flood damage"
Episodic memory returns: "Last time I suggested denial for flood, 
reviewer approved it because the damage was from a burst pipe, 
not external flooding."

Agent adjusts confidence: 0.85 โ†’ 0.60 (triggers HITL)

Over time, the system learns what kinds of overrides happen and adjusts behavior.

Production Best Practices

1. Guardrails Are Not Optional

Guardrail Default Why It Matters
Max agent calls 25 Prevents infinite loops
Max tokens 50,000 Caps LLM usage per claim
Max cost $0.50 Hard dollar limit
Max execution time 300s Timeout for pipeline
Min confidence 0.60 Forces HITL if uncertain

All caps are configurable via environment variables. When a cap is hit, route to HITL with clear reason: "guardrail: token limit exceeded"

2. PII Masking Happens at the Edge

Critical Security Rule

The Intake Agent is the only agent that sees raw PII. Every downstream agent works with masked data. This creates a clear security boundary.

If an LLM logs or leaks data, it's already masked โ€” no SSNs or Aadhaar numbers in the logs.

3. Confidence Thresholds Increase Over Pipeline

Notice the thresholds increase as you get closer to the payout decision:

Agent Threshold Reasoning
Intake 0.55 Mostly rules, low uncertainty
Fraud 0.50 Early in pipeline, can catch later
Damage 0.60 Financial estimate, needs confidence
Policy 0.60 Exclusions are serious
Settlement 0.65 Last financial decision
Evaluator 0.70 Final quality gate

4. Audit Trails for Regulatory Compliance

Insurance is heavily regulated. Every agent action gets logged with:

Auditors can trace any decision back through the exact chain of agent actions.

5. Structured Outputs Are King

Every agent returns structured data (JSON), not prose. This makes the output:

## Good (structured)
{
    "fraud_score": 0.23,
    "decision": "low_risk",
    "reasoning": "...",
    "flags": []
}

## Bad (prose)
"The fraud analysis indicates this claim appears legitimate 
with a low risk score of approximately 0.23..."

6. Multiple HITL Gates Beat One

Don't just have one HITL gate at the end. Have multiple checkpoints throughout:

This creates a safety net where uncertainty triggers oversight, not just end-of-pipeline checks.

7. Use Different LLMs for Different Roles

Use Case Model Choice Why
Pipeline agents gemini-2.5-flash Fast, reasoning-capable
Evaluator gemini-2.5-flash-lite Independent, lower temp (0.3)
Embeddings all-MiniLM-L6-v2 Local, free, good quality

The evaluator uses a different model to avoid bias (model grading its own outputs).

8. Country-Aware Config Files

Don't hardcode country rules in Python. Use YAML files:

configs/
  base.yaml           # Shared settings
  countries/
    us.yaml           # USD, SSN/DL masking, year depreciation
    india.yaml        # INR, Aadhaar/PAN masking, IRDAI part-wise
    [future].yaml     # Add new countries without code changes

Adding a new country = creating a new YAML file. No code changes needed.

9. Token & Cost Tracking Per Claim

Use a LangChain callback handler to track tokens and cost:

from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = agent.invoke(state)
    
    print(f"Tokens: {cb.total_tokens}")
    print(f"Cost: ${cb.total_cost}")

Store this in the claim record. If a claim type consistently costs 10x average, investigate.

10. Fast Mode for Low-Value Claims

Real-world optimization: below a threshold ($500), full investigation costs more than the claim. The system auto-detects these and processes in seconds instead of minutes by skipping Damage + Policy agents.

Evaluation & Monitoring

Why LLM-as-Judge Works

You can't manually review every claim decision. You need automated quality evaluation.

LLM-as-Judge uses a separate LLM to grade the pipeline's output on objective dimensions. This gives you quantitative scores you can track over time.

The 5 Evaluation Dimensions

1. Accuracy (0.0โ€“1.0)

2. Completeness (0.0โ€“1.0)

3. Fairness (0.0โ€“1.0)

4. Safety (0.0โ€“1.0)

5. Transparency (0.0โ€“1.0)

Batch Sampling Strategy

Run the evaluator on:

Track scores over time. If Transparency drops from 0.85 to 0.72, investigate.

Monitoring Metrics

Metric What to Track Alert Threshold
Evaluator composite score Average per day/week Drop > 0.10 from baseline
HITL rate % of claims pausing Spike > 2x baseline
Fraud detection rate % with score > 0.45 Drop to near zero
Average cost per claim LLM token cost Increase > 50%
Pipeline completion time Median time start to end Increase > 2x
Agent confidence scores Average by agent Drop > 0.15

A/B Testing Agent Prompts

When you update an agent's prompt, run A/B test:

  1. Route 10% of claims to new prompt
  2. Compare evaluator scores: old vs new
  3. Compare HITL rates: old vs new
  4. If new prompt improves scores โ†’ roll out to 100%

Human Feedback Loop

When a reviewer overrides an AI decision:

  1. Store the override in episodic memory
  2. Flag the claim for re-evaluation
  3. If many overrides on same agent โ†’ investigate prompt
  4. Use overrides as training data for fine-tuning (future)

Common Pitfalls to Avoid

1. Letting LLMs Do Arithmetic

โŒ Don't Do This

Asking the LLM to compute: settlement = (damage - depreciation) - deductible

Why: LLMs are unreliable at arithmetic. A $0.01 error in a $10K claim is unacceptable.

โœ… Do This Instead

LLM estimates severity and raw cost. Python applies depreciation formula and computes settlement.

2. Generic Error Messages

โŒ Don't Do This

"Your claim was denied."

Why: Legally problematic in insurance. Claimant has right to specific reason.

โœ… Do This Instead

"Your policy excludes flood damage under Section 4.2(c). The incident was caused by external flooding, which falls under this exclusion."

3. Single HITL Gate at the End

โŒ Don't Do This

Only checking if settlement should be reviewed after all agents have run.

Why: Waste of compute. If fraud is 0.75, you don't need to run Damage, Policy, Settlement.

โœ… Do This Instead

Multiple HITL gates: after Fraud (0.45โ€“0.90), after any agent with low confidence, after Evaluator.

4. No Durable Checkpoints

โŒ Don't Do This

Storing HITL pause state in memory.

Why: Server restart = lost state. Reviewer comes back next day, claim is gone.

โœ… Do This Instead

Use LangGraph's SqliteSaver. State persists to disk, survives restarts.

5. Hardcoded Country Rules

โŒ Don't Do This

if country == "us":
    depreciation_rate = 0.15
elif country == "india":
    depreciation_rate = 0.05

Why: Adding a new country = code changes. Not scalable.

โœ… Do This Instead

YAML config files per country. Adding new country = new YAML, zero code changes.

6. Ignoring Token Costs

โŒ Don't Do This

Not tracking LLM usage per claim.

Why: One misconfigured agent can rack up $100s in API costs before you notice.

โœ… Do This Instead

Track tokens + cost per claim. Set hard caps ($0.50 per claim). Alert if average cost spikes.

7. Same LLM for Pipeline and Evaluation

โŒ Don't Do This

Using gpt-4 for both pipeline agents and the evaluator.

Why: Model grading its own outputs creates bias โ€” tends to score itself higher.

โœ… Do This Instead

Use different model for evaluator (e.g., gemini-2.5-flash-lite). Independence matters.

8. No Memory = No Learning

โŒ Don't Do This

Treating every claim as isolated, no historical context.

Why: System never gets better. Day 1000 is no smarter than Day 1.

โœ… Do This Instead

Store outcomes in ChromaDB. Agents search memory to calibrate decisions. System improves over time.

9. Tight Coupling Between Agents

โŒ Don't Do This

def damage_agent(state):
    fraud_score = fraud_agent(state)  # direct call
    ...

Why: Agents become tangled. Can't swap out Fraud agent without breaking Damage agent.

โœ… Do This Instead

Agents read from shared state: fraud_score = state["fraud_score"]. No direct calls.

10. Vague Confidence Scores

โŒ Don't Do This

Asking LLM: "How confident are you?" and letting it return prose like "fairly confident".

Why: Can't compare scores across agents or set thresholds.

โœ… Do This Instead

Force structured output: {"confidence": 0.85} as a float 0.0โ€“1.0. Make it parseable and comparable.