A comprehensive guide based on the Smart Claims Processor case study โ covering LangGraph orchestration, CrewAI sub-crews, memory systems, HITL patterns, and production best practices.
The Smart Claims Processor is a production-grade multi-agent system that processes insurance claims end-to-end. It demonstrates key patterns for building reliable AI systems:
| Agent | Role | LLM vs Rules | Confidence Threshold |
|---|---|---|---|
| Intake | Validate & mask PII | Mostly rules | 0.55 |
| Fraud Crew | 3 specialists + manager | LLM-heavy | 0.50 |
| Damage | Assess severity, depreciate | LLM + deterministic math | 0.60 |
| Policy | Coverage & exclusions | DB lookup + LLM reasoning | 0.60 |
| Settlement | Calculate final payout | LLM validation + Python math | 0.65 |
| Evaluator | Grade pipeline output | LLM (separate model) | 0.70 composite |
| Communication | Generate claimant message | LLM | N/A (always runs) |
Intake โ Fraud โ Damage โ Policy โ Settlement โ Evaluator โ Communication
All 7 agents run sequentially. Triggers when fraud is low (< 0.45), confidence is high, and no exclusions hit.
Any Agent โ [PAUSE] โ Reviewer Decision โ Resume โ Communication
Triggers when: fraud 0.45โ0.90, confidence < threshold, evaluator score < 0.70, or high-value claim.
Intake โ Fraud (โฅ 0.90) โ Communication
Bypasses all other agents. Used for clear fraud patterns.
Intake [FAIL] โ Communication
Missing fields, lapsed policy, or incident date out of range.
Intake โ Fraud โ Settlement โ Evaluator โ Communication
Skips Damage and Policy checks for low-value claims with clean history.
Never trust an LLM with arithmetic that has financial consequences. Use LLMs for judgment (severity assessment, exclusion reasoning) and Python for math (depreciation, settlement formulas).
Agents don't call each other directly. They all read/write from a shared ClaimState TypedDict. This prevents tight coupling and makes the system composable.
Every agent outputs a confidence score. Multiple checkpoints can trigger HITL review. This creates a safety net where uncertainty triggers human oversight.
LangGraph's SqliteSaver persists state to disk. The pipeline survives server restarts. HITL pauses can last hours or days โ state never gets lost.
Role: The gatekeeper. Validates required fields, checks policy status, and masks PII before any data reaches downstream agents.
pipeline_path to "fast" if amount < $500This agent is mostly rule-based. LLMs are optional here, only used if you need to extract structured data from free-text claim descriptions.
Only when claimant submits unstructured text like: "My Honda was rear-ended last Tuesday, bumper damage, shop quoted $4,200"
The LLM extracts: claim_type=auto, amount=4200, damage_type=rear_end_collision
If the frontend already collects structured data, skip the LLM entirely.
This happens before any downstream agent sees the data. Use country-specific regex patterns:
PII_PATTERNS = {
"us": [
(r"\b\d{3}-\d{2}-\d{4}\b", "***-**-****"), # SSN
(r"\b[A-Z]\d{7}\b", "[DL-MASKED]"), # Driver license
(r"\b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b", "***-***-****"), # Phone
],
"india": [
(r"\b\d{4}\s\d{4}\s\d{4}\b", "**** **** ****"), # Aadhaar
(r"\b[A-Z]{5}\d{4}[A-Z]\b", "[PAN-MASKED]"), # PAN
(r"\b[6-9]\d{9}\b", "**********"), # Mobile
],
}
{
"intake_passed": true,
"masked_data": {
"claimant_name": "John Doe [PII-MASKED]",
"description": "Car accident. My number is ***-***-****",
"claim_amount": 4200.0
},
"pipeline_path": "normal", # or "fast" if amount < $500
"confidence_scores": {
"intake": 0.95
}
}
If intake fails, immediately route to Communication Agent with a specific denial reason:
Never send a vague "your claim cannot be processed" message.
Role: Multi-agent fraud analysis using CrewAI. Three specialist agents investigate the claim, a manager synthesizes their findings into one fraud score.
Tool: search_fraud_patterns()
Checks claim against known fraud schemes: staged accidents, phantom passengers, inflated invoices, pre-existing damage.
Tool: check_statistical_outlier()
Computes z-score vs benchmark. Flags if repair cost is 2+ standard deviations from the mean for this damage type.
Tool: check_claimant_history()
Cross-references claimant's prior claims. Flags multiple claims in short periods or claims across providers.
No tools โ reads all 3 reports
Synthesizes findings into final fraud score (0.0โ1.0) with reasoning and flags.
crew = Crew(
agents=[pattern_analyst, anomaly_detector,
social_validator, manager_agent],
tasks=[pattern_task, anomaly_task, social_task, synthesis_task],
process=Process.sequential, # each feeds into next
verbose=True
)
# synthesis_task has context=[pattern_task, anomaly_task, social_task]
# This is how the manager reads all three reports
| Fraud Score | Action | Next Node |
|---|---|---|
| < 0.45 | Low risk, continue pipeline | Damage Assessor (or Settlement if fast mode) |
| 0.45 โ 0.89 | Suspicious, pause for review | HITL Checkpoint |
| โฅ 0.90 | Confirmed fraud, auto-reject | Communication Agent |
Fraud detection benefits from multiple perspectives. A single LLM call might miss patterns that a team of specialists catches. CrewAI's crew abstraction with a manager maps perfectly to this โ three analysts submit reports, the manager synthesizes.
You could do this with three separate LangGraph nodes, but CrewAI's built-in delegation and debate mechanisms produce better fraud scores with less code.
Notice the tools return structured JSON, not prose. The LLM reads the JSON and incorporates it into its reasoning:
@tool("check_statistical_outlier")
def check_statistical_outlier(claim_type: str, amount: float) -> str:
benchmarks = {"auto": {"mean": 4000, "std": 1500}}
b = benchmarks.get(claim_type)
z_score = (amount - b["mean"]) / b["std"]
return json.dumps({
"z_score": round(z_score, 2),
"is_outlier": abs(z_score) > 2,
"benchmark_mean": b["mean"]
})
Never give the LLM direct database access. Always have a Python tool that runs the query and returns structured data. The LLM reasons about the data, not about SQL.
{
"fraud_score": 0.23,
"decision": "low_risk",
"reasoning": "No known patterns matched. Claim amount within normal range...",
"flags": ["low_risk"],
"pattern_matches": [],
"anomaly_detected": false,
"history_clean": true
}
Role: Assess damage severity and estimate repair cost. Apply country-specific depreciation rules deterministically.
search_similar_claims() tool to calibratedef apply_depreciation_us(raw_cost: float, vehicle_age_years: int) -> dict:
rates = {1: 0.20, 2: 0.15, 3: 0.12, 4: 0.10}
rate = rates.get(vehicle_age_years, 0.08) # 8% for 5+ years
depreciation = raw_cost * rate
final = raw_cost - depreciation
return {
"raw_cost": raw_cost,
"depreciation_rate": rate,
"depreciation_amount": round(depreciation, 2),
"final_cost": round(final, 2),
"method": "year_based"
}
def apply_depreciation_india(parts: dict) -> dict:
# parts = {"rubber": 2000, "metal": 8000, "glass": 1500}
rates = {"rubber": 0.50, "metal": 0.05, "glass": 0.00}
total_raw = 0
total_after = 0
for part, cost in parts.items():
rate = rates.get(part, 0.10)
after = cost * (1 - rate)
total_raw += cost
total_after += after
return {
"total_raw_cost": round(total_raw, 2),
"total_after_dep": round(total_after, 2),
"method": "part_wise_irdai"
}
The LLM can call search_similar_claims() to calibrate its estimate against historical data:
@tool("search_similar_claims")
def search_similar_claims(claim_type: str, description: str) -> str:
# ChromaDB semantic search (covered in memory section)
results = long_term_collection.query(
query_embeddings=[embed_text(description)],
n_results=5
)
avg_settlement = sum(r["settlement"] for r in results) / len(results)
return json.dumps({
"similar_claims_found": len(results),
"average_settlement": avg_settlement,
"common_severity": "medium"
})
On day 1 with zero historical claims, the LLM relies on its training data. By day 100 with 500 completed claims, it searches memory and finds: "Similar 2020 Honda Civics averaged $4,200". The system gets smarter over time.
If confidence < 0.60, the pipeline pauses for HITL review. Common reasons:
{
"llm_assessment": {
"severity": "medium",
"raw_cost_estimate": 4200.0,
"confidence": 0.88,
"reasoning": "Front-end collision with bumper and hood damage...",
"vehicle_age_years": 3
},
"after_depreciation": {
"method": "year_based",
"depreciation_rate": 0.15,
"depreciation_amount": 630.0,
"final_cost": 3570.0
}
}
Role: Determine if the claim is covered under the policy. Check exclusions, apply deductible, calculate maximum payout.
Policy rules are structured data (coverage limits, deductibles, exclusions). Python fetches those facts from the database. The LLM then applies judgment:
@tool("get_policy_details")
def get_policy_details(policy_number: str, claim_type: str) -> str:
policy = POLICY_DB.get(policy_number)
return json.dumps({
"coverage_limit": policy["limits"].get(claim_type, 0),
"deductible": policy["deductibles"].get(claim_type, 0),
"exclusions": policy["exclusions"],
"prior_claims": policy["prior_approved_claims"],
"total_prior_paid": sum(c["amount"] for c in policy["prior_claims"])
})
@tool("check_remaining_coverage")
def check_remaining_coverage(policy_number: str,
claim_type: str,
damage_estimate: float) -> str:
policy = POLICY_DB.get(policy_number)
limit = policy["limits"].get(claim_type, 0)
deductible = policy["deductibles"].get(claim_type, 0)
prior_paid = sum(c["amount"] for c in policy["prior_claims"])
remaining_limit = limit - prior_paid
payable_damage = max(0, damage_estimate - deductible)
max_payout = min(payable_damage, remaining_limit)
return json.dumps({
"remaining_limit": remaining_limit,
"deductible": deductible,
"max_payout": round(max_payout, 2),
"coverage_exhausted": remaining_limit <= 0
})
The system prompt explicitly guides exclusion reasoning:
This creates a bias toward the claimant โ wrongful denial is worse than human review.
| Condition | Next Node |
|---|---|
| Ineligible (exclusion hit) | Communication Agent (denial) |
| Confidence < 0.60 | HITL Checkpoint |
| Eligible | Settlement Calculator |
{
"eligible": true,
"max_payout": 8000.0,
"deductible": 500.0,
"exclusions_hit": [],
"coverage_limit": 15000.0,
"remaining_coverage": 13800.0,
"confidence": 0.85,
"reasoning": "Claim falls within auto coverage. No exclusions apply...",
"flags": []
}
When denying for an exclusion, the LLM must cite the exact policy section:
"Your policy (POL-001) includes a flood damage exclusion under Section 4.2(c). The incident was caused by rising floodwater, which falls under this exclusion."
This is both a UX requirement and a legal one in insurance.
Role: Compute the final settlement amount using country-specific formulas. The LLM validates inputs, Python does the math.
settlement = min(
damage_after_depreciation,
damage_after_depreciation * 1.15, # 115% buffer
remaining_coverage_limit
) - deductible
The 115% cap is a real insurance industry rule โ allows slight over-assessment buffer for unforeseen costs.
settlement = min(
idv, # Insured Declared Value
remaining_coverage_limit
) - deductible
No buffer. IRDAI regulations cap at exactly 100% of IDV.
By this point in the pipeline, you have structured numbers from prior agents. The LLM's job flips from reasoner to validator:
@tool("verify_settlement_consistency")
def verify_settlement_consistency(damage_estimate: float,
policy_max_payout: float,
claim_amount: float) -> str:
ratio = damage_estimate / claim_amount if claim_amount > 0 else 0
flags = []
if ratio > 1.5:
flags.append("damage_estimate_exceeds_claim_by_50pct")
if ratio < 0.2:
flags.append("damage_estimate_very_low_vs_claim")
if damage_estimate > policy_max_payout * 2:
flags.append("damage_far_exceeds_coverage")
return json.dumps({
"damage_to_claim_ratio": round(ratio, 2),
"flags": flags,
"looks_consistent": len(flags) == 0
})
If the LLM spots that a prior agent's output is clearly wrong, it can set override_damage_estimate before the Python formula runs:
{
"inputs_validated": false,
"confidence": 0.45,
"reasoning": "Damage estimate of $50,000 for minor scratch is clearly wrong...",
"flags": ["damage_estimate_unrealistic"],
"override_damage_estimate": 800.0 # LLM's corrected estimate
}
This is a rare but important escape hatch for obvious errors.
Settlement is the last financial decision before payout. It demands more certainty. If confidence < 0.65, route to HITL.
The output should include a clear breakdown so the claimant (and auditors) can verify the math:
{
"method": "damage_based_us",
"damage_estimate": 4200.0,
"damage_cap_115": 4830.0,
"deductible": 500.0,
"settlement_amount": 3700.0,
"currency": "USD"
}
Role: Quality gate. A separate LLM grades the entire pipeline's output on 5 dimensions before the claim is released.
Without systematic evaluation, you have no way to know if your pipeline is producing good outputs. Manual spot-checking doesn't scale.
The evaluator runs on every claim (or a sample) and gives you quantitative scores you can track over time. If the composite score drops from 0.85 to 0.72 over a week, you know something broke.
| Dimension | What It Measures | Maps To |
|---|---|---|
| Accuracy | Are the numbers correct? Does settlement = damage - deductible? | Auditability |
| Completeness | Were all relevant factors considered? Any missing steps? | Thoroughness |
| Fairness | No demographic bias? Similar claims treated similarly? | Anti-discrimination laws |
| Safety | No harmful recommendations? Protects claimant interests? | Consumer protection |
| Transparency | Is reasoning clear? Can claimant understand the decision? | Explainability mandates |
Each dimension gets a score 0.0โ1.0. Composite score = average of 5 dimensions.
If composite score โฅ 0.70 โ Continue to Communication Agent
If composite score < 0.70 โ Pause for HITL review
Critical pattern: the evaluator uses a separate model than the pipeline agents:
# Pipeline agents use this
pipeline_llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")
# Evaluator uses this (different model, lower temperature)
evaluator_llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash-lite",
temperature=0.3 # lower temp for consistency
)
Why? Avoids bias where a model grades its own outputs favorably. The evaluator should be independent.
Evaluator runs at temperature=0.3 (vs 0.7โ1.0 for reasoning agents). You want consistent scoring โ two identical claims should get the same evaluation score.
The evaluator never modifies the settlement amount, fraud score, or any decision. It just says "this looks good" or "this needs human review."
Outputs from Agents 1-5 flow through unchanged. This separation of concerns makes the system auditable.
The doc mentions the evaluator runs on 10% of auto-processed claims as a batch job. Over time you track:
This is how you detect model drift before customers complain.
{
"accuracy": 0.90,
"completeness": 0.85,
"fairness": 0.92,
"safety": 0.95,
"transparency": 0.78,
"composite_score": 0.88,
"passed": true,
"critical_flags": [],
"reasoning": "All numbers check out. Reasoning is clear..."
}
Role: Final agent. Generates the message sent to the claimant. Every pipeline path ends here.
| Outcome | Message Must Include |
|---|---|
| Approved | Settlement amount, breakdown (damage - deductible = payout), payment timeline, next steps |
| Denied | Specific reason (never generic), policy clause cited, appeal rights, timeline |
| HITL Pending | Review in progress, expected timeline, reviewer contact info |
| Auto-Rejected (Fraud) | Specific red flags (not accusatory), investigation notice, appeal process |
"Your claim has been denied."
"Your policy (POL-001) includes a flood damage exclusion under Section 4.2(c). The incident on June 15 was caused by rising floodwater from the nearby river, which falls under this exclusion. Water damage from internal sources (burst pipes, appliance leaks) would be covered, but external flooding is specifically excluded."
The LLM extracts the actual reason from the pipeline trace and explains it clearly. This is both a UX and legal requirement in insurance.
Every message gets a regulatory footer appended (not optional):
REGULATORY_FOOTERS = {
"us": """
---
If you disagree with this decision, you have the right to appeal.
Contact your State Insurance Commissioner:
https://content.naic.org/state-insurance-departments
For questions, call 1-800-CLAIMS-1 (M-F 9am-5pm ET)
Reference claim ID: {claim_id}
""",
"india": """
---
เคฏเคฆเคฟ เคเคช เคเคธ เคจเคฟเคฐเฅเคฃเคฏ เคธเฅ เค
เคธเคนเคฎเคค เคนเฅเค, เคคเฅ เคเคชเคเฅ เค
เคชเฅเคฒ เคเคฐเคจเฅ เคเคพ เค
เคงเคฟเคเคพเคฐ เคนเฅเฅค
Contact IRDAI Grievance Redressal:
https://www.irdai.gov.in
For questions, call 1800-425-4732 (M-F 10am-6pm IST)
Reference claim ID: {claim_id}
"""
}
The system prompt tells the LLM explicitly:
This prevents inappropriate tone (e.g., being cheerful about a denial).
Dear John Doe,
Good news! Your auto claim (CLM-2024-001) has been approved.
Claim breakdown:
- Damage assessment: $4,200.00
- Depreciation (15%): -$630.00
- Adjusted damage: $3,570.00
- Deductible: -$500.00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Settlement amount: $3,070.00
Payment will be processed within 3-5 business days via direct
deposit. You'll receive confirmation once complete.
Next steps:
1. Schedule repairs with your preferred shop
2. Forward final invoice to claims@insurance.com
3. Keep your claim ID handy: CLM-2024-001
---
[Regulatory footer]
Dear John Doe,
Your claim (CLM-2024-089) has been flagged and cannot be processed.
Our fraud detection identified these concerns:
- Claim matches a known staged accident pattern
- Repair estimate is 280% above benchmark for this damage
- This is your third claim in 6 months across two providers
Investigation notice:
Your claim has been referred to our Special Investigations Unit.
You will be contacted within 10 business days.
You have the right to provide additional documentation.
Contact: fraud-review@insurance.com
---
[Regulatory footer]
LangGraph is the glue that connects all 7 agents into a working pipeline. It turns a collection of functions into a state machine with durable checkpoints and conditional routing.
| Concept | What It Does |
|---|---|
| StateGraph | Defines nodes (agents) and edges (connections) |
| Nodes | Python functions: (state) โ updated_state |
| Edges | Connections between nodes, can be conditional |
| State (TypedDict) | Shared data structure flowing through pipeline |
| Checkpoints | Durable state snapshots (survives crashes) |
| interrupt() | Pauses execution for HITL, resume later |
Every agent reads from and writes to this shared state:
class ClaimState(TypedDict):
# Input data
claim_id: str
claimant_name: str
policy_number: str
claim_type: str
incident_date: str
claim_amount: float
description: str
country: str
# Intake agent
masked_data: dict
intake_passed: bool
intake_denial_reason: str | None
# Fraud agent
fraud_score: float
# Routing
pipeline_path: str # "normal" | "hitl" | "auto_reject" | "invalid" | "fast"
# All agent outputs
agent_outputs: dict # {agent_name: output_dict}
confidence_scores: dict # {agent_name: confidence_float}
# HITL
hitl_ticket: dict
pipeline_status: str
workflow = StateGraph(ClaimState)
# Add all nodes
workflow.add_node("intake", intake_agent_node)
workflow.add_node("fraud", fraud_detection_node)
workflow.add_node("damage", damage_assessor_node)
workflow.add_node("policy", policy_checker_node)
workflow.add_node("settlement", settlement_calculator_node)
workflow.add_node("evaluator", evaluator_node)
workflow.add_node("communication", communication_agent_node)
workflow.add_node("hitl", hitl_checkpoint_node)
# Set entry point
workflow.set_entry_point("intake")
Every edge is a function that reads state and returns the next node name:
def route_after_fraud(state: ClaimState) -> str:
score = state.get("fraud_score", 0.0)
path = state.get("pipeline_path", "normal")
if score >= 0.90:
return "communication" # auto-reject
elif score >= 0.45:
return "hitl" # pause for review
elif path == "fast":
return "settlement" # skip damage + policy
else:
return "damage" # normal path
workflow.add_conditional_edges(
"fraud",
route_after_fraud,
{
"damage": "damage",
"settlement": "settlement",
"communication": "communication",
"hitl": "hitl"
}
)
There's no explicit "path A" or "path B" code. The paths emerge from routing logic:
intake โ fraud โ damage โ policy โ settlement โ evaluator โ communication
All 7 agents run. Fraud < 0.45, all confidence gates pass.
any_agent โ hitl โ [PAUSE] โ resume โ next_agent โ communication
Triggers when: fraud 0.45โ0.90, confidence < threshold, or eval < 0.70.
intake โ fraud (โฅ0.90) โ communication
Confirmed fraud. Bypasses all other agents.
intake [FAIL] โ communication
Missing fields, lapsed policy, or date out of range.
intake โ fraud โ settlement โ evaluator โ communication
Amount < $500. Skips damage and policy checks.
from langgraph.checkpoint.sqlite import SqliteSaver
checkpointer = SqliteSaver.from_conn_string("claims_checkpoints.db")
graph = workflow.compile(
checkpointer=checkpointer,
interrupt_before=["hitl"] # pause before HITL node
)
What this gives you:
thread_id = claim_id โ each claim has its own checkpointdef process_claim(claim_data: dict) -> dict:
config = {"configurable": {"thread_id": claim_data["claim_id"]}}
result = graph.invoke(claim_data, config)
return result
from langgraph.types import Command
def resume_hitl_claim(claim_id: str, reviewer_decision: dict) -> dict:
config = {"configurable": {"thread_id": claim_id}}
# Command tells LangGraph to resume with new data
result = graph.invoke(
Command(resume=reviewer_decision),
config
)
return result
LangChain's AgentExecutor is great for simple tool-calling loops but falls apart for complex workflows with:
LangGraph gives you explicit control over the graph topology. You define exactly which node runs after which, under what conditions.
Memory is what lets agents learn from past claims instead of treating every claim as the first one they've ever seen.
| Tier | Storage | What It Holds | Lifetime |
|---|---|---|---|
| 1. Short-Term | LangGraph State | Current claim data | One execution (+ checkpoint) |
| 2. Long-Term | ChromaDB collection | All completed claim outcomes | Permanent (7-year audit) |
| 3. Episodic | ChromaDB collection | Human overrides, fraud cases | Permanent (learning data) |
This is just the ClaimState TypedDict flowing through the pipeline. Every agent reads/writes to it. Exists only during processing (plus checkpoint persistence for HITL pauses).
{
"claim_id": "CLM-2023-045",
"claim_type": "auto",
"vehicle_type": "sedan",
"damage_type": "front_end_collision",
"incident_description": "Rear-ended at stoplight...",
"final_settlement": 3800,
"damage_severity": "medium",
"fraud_score": 0.12,
"decision": "approved",
"timestamp": "2023-06-15T14:23:00Z"
}
{
"episode_type": "fraud_confirmed",
"claim_id": "CLM-2023-089",
"description": "Staged accident, phantom passenger scheme",
"fraud_indicators": [
"3_claims_in_6_months",
"repair_shop_flagged",
"witness_inconsistent"
],
"outcome": "denied_after_investigation",
"lesson": "Multiple claims + flagged shop = high fraud risk"
}
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# 80MB, runs on CPU in ~50ms, no API calls
def embed_text(text: str) -> List[float]:
return embedding_model.encode(text).tolist()
import chromadb
client = chromadb.PersistentClient(path="./chromadb_data")
long_term_collection = client.get_or_create_collection(
name="long_term_claims"
)
episodic_collection = client.get_or_create_collection(
name="episodic_memory"
)
def store_claim_outcome(claim_state: dict) -> None:
description = f"""
Claim type: {claim_state['claim_type']}
Incident: {claim_state['masked_data']['description']}
Severity: {damage_out.get('severity')}
Settlement: ${settlement_out.get('amount')}
"""
long_term_collection.add(
ids=[claim_state["claim_id"]],
embeddings=[embed_text(description)],
documents=[description],
metadatas=[{
"claim_type": claim_state["claim_type"],
"settlement_amount": settlement_out["amount"],
"fraud_score": claim_state["fraud_score"]
}]
)
@tool("search_similar_claims")
def search_similar_claims(query: str, claim_type: str = None) -> str:
where_filter = {"claim_type": claim_type} if claim_type else None
results = long_term_collection.query(
query_embeddings=[embed_text(query)],
n_results=5,
where=where_filter
)
avg_settlement = sum(
r["metadata"]["settlement_amount"]
for r in results
) / len(results)
return json.dumps({
"similar_claims_found": len(results),
"average_settlement": round(avg_settlement, 2),
"examples": [...]
})
Day 1: Zero historical claims. Damage Assessor relies on LLM training data only.
Day 100: 500 completed claims. Searches memory: "For 2020 Honda Civics with front-end damage, we settled 12 similar claims averaging $4,150."
Day 1000: 5000+ claims. Memory so rich that estimates are highly accurate from pattern matching alone.
Embeddings enable semantic similarity, not just keyword matching:
Stored claim: "Front-end collision, Honda Civic, bumper damage"
Search query: "Rear impact on sedan"
Result: ChromaDB finds them similar (both involve collision + sedan) even though exact words don't match.
Keyword search would miss this connection. Vector search captures it.
When a reviewer overrides an AI decision, that gets stored in episodic memory. Next time a similar claim comes through:
Agent searches: "Settlement calculation for flood damage"
Episodic memory returns: "Last time I suggested denial for flood,
reviewer approved it because the damage was from a burst pipe,
not external flooding."
Agent adjusts confidence: 0.85 โ 0.60 (triggers HITL)
Over time, the system learns what kinds of overrides happen and adjusts behavior.
| Guardrail | Default | Why It Matters |
|---|---|---|
| Max agent calls | 25 | Prevents infinite loops |
| Max tokens | 50,000 | Caps LLM usage per claim |
| Max cost | $0.50 | Hard dollar limit |
| Max execution time | 300s | Timeout for pipeline |
| Min confidence | 0.60 | Forces HITL if uncertain |
All caps are configurable via environment variables. When a cap is hit, route to HITL with clear reason: "guardrail: token limit exceeded"
The Intake Agent is the only agent that sees raw PII. Every downstream agent works with masked data. This creates a clear security boundary.
If an LLM logs or leaks data, it's already masked โ no SSNs or Aadhaar numbers in the logs.
Notice the thresholds increase as you get closer to the payout decision:
| Agent | Threshold | Reasoning |
|---|---|---|
| Intake | 0.55 | Mostly rules, low uncertainty |
| Fraud | 0.50 | Early in pipeline, can catch later |
| Damage | 0.60 | Financial estimate, needs confidence |
| Policy | 0.60 | Exclusions are serious |
| Settlement | 0.65 | Last financial decision |
| Evaluator | 0.70 | Final quality gate |
Insurance is heavily regulated. Every agent action gets logged with:
Auditors can trace any decision back through the exact chain of agent actions.
Every agent returns structured data (JSON), not prose. This makes the output:
## Good (structured)
{
"fraud_score": 0.23,
"decision": "low_risk",
"reasoning": "...",
"flags": []
}
## Bad (prose)
"The fraud analysis indicates this claim appears legitimate
with a low risk score of approximately 0.23..."
Don't just have one HITL gate at the end. Have multiple checkpoints throughout:
This creates a safety net where uncertainty triggers oversight, not just end-of-pipeline checks.
| Use Case | Model Choice | Why |
|---|---|---|
| Pipeline agents | gemini-2.5-flash | Fast, reasoning-capable |
| Evaluator | gemini-2.5-flash-lite | Independent, lower temp (0.3) |
| Embeddings | all-MiniLM-L6-v2 | Local, free, good quality |
The evaluator uses a different model to avoid bias (model grading its own outputs).
Don't hardcode country rules in Python. Use YAML files:
configs/
base.yaml # Shared settings
countries/
us.yaml # USD, SSN/DL masking, year depreciation
india.yaml # INR, Aadhaar/PAN masking, IRDAI part-wise
[future].yaml # Add new countries without code changes
Adding a new country = creating a new YAML file. No code changes needed.
Use a LangChain callback handler to track tokens and cost:
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
result = agent.invoke(state)
print(f"Tokens: {cb.total_tokens}")
print(f"Cost: ${cb.total_cost}")
Store this in the claim record. If a claim type consistently costs 10x average, investigate.
Real-world optimization: below a threshold ($500), full investigation costs more than the claim. The system auto-detects these and processes in seconds instead of minutes by skipping Damage + Policy agents.
You can't manually review every claim decision. You need automated quality evaluation.
LLM-as-Judge uses a separate LLM to grade the pipeline's output on objective dimensions. This gives you quantitative scores you can track over time.
Run the evaluator on:
Track scores over time. If Transparency drops from 0.85 to 0.72, investigate.
| Metric | What to Track | Alert Threshold |
|---|---|---|
| Evaluator composite score | Average per day/week | Drop > 0.10 from baseline |
| HITL rate | % of claims pausing | Spike > 2x baseline |
| Fraud detection rate | % with score > 0.45 | Drop to near zero |
| Average cost per claim | LLM token cost | Increase > 50% |
| Pipeline completion time | Median time start to end | Increase > 2x |
| Agent confidence scores | Average by agent | Drop > 0.15 |
When you update an agent's prompt, run A/B test:
When a reviewer overrides an AI decision:
Asking the LLM to compute: settlement = (damage - depreciation) - deductible
Why: LLMs are unreliable at arithmetic. A $0.01 error in a $10K claim is unacceptable.
LLM estimates severity and raw cost. Python applies depreciation formula and computes settlement.
"Your claim was denied."
Why: Legally problematic in insurance. Claimant has right to specific reason.
"Your policy excludes flood damage under Section 4.2(c). The incident was caused by external flooding, which falls under this exclusion."
Only checking if settlement should be reviewed after all agents have run.
Why: Waste of compute. If fraud is 0.75, you don't need to run Damage, Policy, Settlement.
Multiple HITL gates: after Fraud (0.45โ0.90), after any agent with low confidence, after Evaluator.
Storing HITL pause state in memory.
Why: Server restart = lost state. Reviewer comes back next day, claim is gone.
Use LangGraph's SqliteSaver. State persists to disk, survives restarts.
if country == "us":
depreciation_rate = 0.15
elif country == "india":
depreciation_rate = 0.05
Why: Adding a new country = code changes. Not scalable.
YAML config files per country. Adding new country = new YAML, zero code changes.
Not tracking LLM usage per claim.
Why: One misconfigured agent can rack up $100s in API costs before you notice.
Track tokens + cost per claim. Set hard caps ($0.50 per claim). Alert if average cost spikes.
Using gpt-4 for both pipeline agents and the evaluator.
Why: Model grading its own outputs creates bias โ tends to score itself higher.
Use different model for evaluator (e.g., gemini-2.5-flash-lite). Independence matters.
Treating every claim as isolated, no historical context.
Why: System never gets better. Day 1000 is no smarter than Day 1.
Store outcomes in ChromaDB. Agents search memory to calibrate decisions. System improves over time.
def damage_agent(state):
fraud_score = fraud_agent(state) # direct call
...
Why: Agents become tangled. Can't swap out Fraud agent without breaking Damage agent.
Agents read from shared state: fraud_score = state["fraud_score"]. No direct calls.
Asking LLM: "How confident are you?" and letting it return prose like "fairly confident".
Why: Can't compare scores across agents or set thresholds.
Force structured output: {"confidence": 0.85} as a float 0.0โ1.0. Make it parseable and comparable.