// Engineering Notes · 2025

Agentic AI vs MCP

A complete field guide to the paradigms, patterns, pitfalls, and production realities of autonomous AI systems and the Model Context Protocol.

Agentic AI MCP LangGraph · Tool Use · RAG
🧠
01

Definitions & Mental Models

Agentic AI

An autonomous reasoning loop where an LLM plans, takes actions, observes outcomes, and iterates toward a goal — without a fixed, pre-specified execution path.

  • 🔁Think → Act → Observe loop
  • 🎯Goal-directed, not instruction-directed
  • 🛠️Decides what tools to use and when
  • 📐Can decompose multi-step tasks
  • 🧩Framework: LangGraph, AutoGen, CrewAI

MCP (Model Context Protocol)

A standardized protocol (by Anthropic) that defines how LLMs connect to external tools, data sources, and services — a universal plug-in interface.

  • 🔌Protocol, not a framework
  • 📦Exposes: Tools, Resources, Prompts
  • 🌐Transport: stdio, HTTP/SSE
  • 📋JSON-RPC 2.0 message format
  • 🏗️Hosts: Claude Desktop, custom apps
💡

The Key Mental Model

Agentic AI is the behavior — MCP is the plumbing. Agentic AI describes how an LLM reasons autonomously across multiple steps. MCP describes how the LLM accesses external capabilities. You can have agentic AI without MCP (raw function calls), and MCP without agentic AI (single-turn tool use).

Core Primitives

🤖

Agent

An LLM + instructions + tools that can autonomously execute multi-step workflows. Has its own memory, reasoning chain, and decision loop.

🔧

MCP Tool

A callable function exposed by an MCP server. The LLM can discover, invoke, and get results from tools. Tools have JSON Schema inputs.

📚

MCP Resource

Read-only data the model can access (files, DB rows, API responses). Unlike tools, resources don't perform actions — they provide context.

📝

MCP Prompt

Pre-built, parameterized prompt templates exposed by an MCP server. Useful for standardizing how an LLM interacts with a specific domain.

🗂️

Tool Call (native)

Raw function-calling in OpenAI/Anthropic APIs. Not standardized — each integration is bespoke. Predecessor pattern to MCP.

⚙️

Orchestrator

The system that manages agent state, routes between agents, and aggregates results. Can be LangGraph, custom state machine, or a parent agent.

⚖️
02

Core Differences

Dimension Agentic AI MCP
Nature Behavioral paradigm / system design Communication protocol / standard
Scope End-to-end task completion, reasoning loops How LLMs connect to external capabilities
Decision-making LLM decides what to do and when Protocol delivers how to access a capability
State Maintains task state, memory, context across steps Stateless per call (server can be stateful)
Composability Agents can spawn sub-agents (multi-agent) Servers can be composed by listing multiple
Failure handling Agent retries, replans, backtracks Protocol errors → caller handles
Auth Embedded in agent design (API keys, tokens) OAuth 2.1 / bearer tokens in protocol headers
Portability Framework-specific (LangGraph ≠ CrewAI) Any MCP host can use any MCP server
Latency Multi-step = higher end-to-end latency Single server roundtrip per tool call
Cost Multiple LLM calls accumulate fast N/A (protocol layer, not LLM calls)
🏗️
03

Architecture Patterns

MCP Architecture

HOST
MCP Host — Claude Desktop, your app, IDE plugin. Manages LLM lifecycle.
CLIENT
MCP Client — Embedded in the host. Handles protocol negotiation, capability listing, tool invocation.
TRANSPORT
stdio (local process) or HTTP + SSE (remote). JSON-RPC 2.0 messages.
SERVER
MCP Server — Exposes Tools, Resources, Prompts. Can be Python (FastMCP), TypeScript, or any language.
BACKEND
External Services — GitHub, Postgres, Slack, internal APIs, file system, etc.

Agentic Loop Architecture (ReAct Pattern)

User Goal
LLM Thinks
(plan step)
Select Tool
(function call)
Execute Tool
Observation
(result)
LLM Evaluates
(done?)
Final Answer
🔁

Loop continues

If "done?" → No, the arrow from "LLM Evaluates" loops back to "LLM Thinks" with the new observation appended to context. This is the fundamental agentic loop.

Multi-Agent Architecture

LangGraph · Python
# Orchestrator → Specialist Agent pattern
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import create_react_agent

# Specialist agents (each has focused tools)
researcher = create_react_agent(llm, tools=[web_search, arxiv_search])
writer     = create_react_agent(llm, tools=[draft_doc, format_output])
validator  = create_react_agent(llm, tools=[fact_check, cite_sources])

def router(state):
    # Orchestrator decides next node
    if state["phase"] == "research": return "researcher"
    if state["phase"] == "write":    return "writer"
    if state["phase"] == "validate": return "validator"
    return END

graph = StateGraph(AgentState)
graph.add_node("researcher", researcher)
graph.add_node("writer",     writer)
graph.add_node("validator",  validator)
graph.add_conditional_edges("orchestrator", router)
🎯
04

When to Use Which

Scenario Use Why
Simple Q&A with a database lookup MCP Only Single tool call, no planning needed. MCP server wraps the DB query.
Research a topic and write a report Agentic Requires iterative search → synthesize → draft → refine loops.
Standardize tool access across teams MCP Only Protocol portability. One MCP server, many host applications.
Execute a multi-step manufacturing workflow Both Agent orchestrates logic; MCP servers expose ERP/MES APIs.
IDE code completion + context MCP Only File system + repo resources via MCP. No autonomous loop needed.
Autonomous bug fix with PR submission Both Agent plans fix; MCP tools for GitHub, file I/O, test runners.
LLM-as-Judge evaluation pipeline Agentic Judge LLM needs to reason across criteria, aggregate scores, decide.
Real-time customer support bot Both Lightweight agent loop + MCP tools for CRM, ticketing, KB search.
BOM data enrichment pipeline Agentic Multi-stage transform (classify → enrich → validate) with branching.
Expose internal data to Claude Desktop MCP Only MCP server is the right primitive for exposing context to a host.

Decision Flowchart

// Ask these questions in order:

Q1
Is this a single-turn action with a known, fixed tool?
→ Yes: Use native tool call or MCP tool — no agent needed.
Q2
Does the task require planning, iteration, or sub-task decomposition?
→ Yes: You need an Agentic loop — pick a framework.
Q3
Will the tools be reused across multiple host apps or teams?
→ Yes: Wrap tools in MCP servers for portability and standardized auth.
Q4
Do you need both complex reasoning AND standardized tool access?
→ Yes: Use Agentic AI + MCP together — agent calls MCP tools.
🔀
05

The Hybrid Pattern: Agentic AI + MCP

The most production-ready systems combine both. The agent provides reasoning and orchestration; MCP provides standardized, portable, discoverable tool access.

Python · LangGraph + MCP
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent
from langchain_anthropic import ChatAnthropic

# Connect to MCP servers (tools discovered automatically)
client = MultiServerMCPClient({
    "filesystem": {"command": "npx", "args": ["@modelcontextprotocol/server-filesystem", "/data"]},
    "postgres":   {"url": "http://localhost:8001/sse"},
    "slack":      {"url": "http://localhost:8002/sse"},
})

async def run_agent(task: str):
    # Tools are auto-discovered from MCP servers
    tools = await client.get_tools()

    agent = create_react_agent(
        ChatAnthropic(model="claude-opus-4-5"),
        tools=tools,
        state_modifier="You are a factory planning assistant..."
    )

    async for chunk in agent.astream({"messages": [("user", task)]}):
        print(chunk)  # streamed agent steps

Why this pattern wins

Agent logic stays in your code. Tool implementations stay in MCP servers. Swapping an MCP server (e.g., upgrading your Postgres MCP) doesn't change the agent. Swapping the agent (e.g., from LangGraph to custom) doesn't change the tools.

🤖
06

Agentic AI Implementation Patterns

The most common pattern. Interleaves Reasoning and Acting. The LLM generates a thought, picks an action, gets an observation, repeats.

ReAct · Pseudo
Thought: I need to find the TAKT time for Station 3.
Action: query_database(table="stations", filter="id=3")
Observation: {"id":3, "cycle_time":27, "unit":"seconds"}
Thought: Cycle time is 27s. I need to compare with demand.
Action: calculate_takt(demand=1000, shift_hours=8)
Observation: {"takt_time": 28.8, "unit": "seconds"}
Thought: Station 3 (27s) is under TAKT (28.8s). OK.
Final Answer: Station 3 is within capacity.

Two-phase: an LLM Planner creates a structured task list, then an Executor works through each step. Better for complex, predictable workflows.

Python · Plan & Execute
# Phase 1: Planner generates structured steps
plan = planner_llm.invoke({
    "task": "Validate machine selection for bearing line"
})
# plan.steps = ["1. Fetch BOM", "2. Get TAKT", "3. Match machines", "4. Validate"]

# Phase 2: Executor runs each step with tools
results = []
for step in plan.steps:
    result = executor_agent.run(step, context=results)
    results.append(result)
    if result.should_replan:
        plan = planner_llm.replan(plan, results)  # replan if stuck

An Orchestrator agent routes subtasks to Specialist agents. Each specialist has a focused system prompt and tool set. Great for domain isolation.

⚠️

Trust Boundary Warning

In multi-agent systems, the orchestrator should not blindly trust sub-agent outputs. Validate outputs structurally (Pydantic) before feeding to the next agent. An adversarial tool result could inject malicious instructions.

Supervisor Pattern

One orchestrator LLM decides which agent to invoke next. Clean but bottlenecked on the orchestrator.

Swarm / Handoff

Agents pass control peer-to-peer. More flexible, harder to reason about execution paths.

Hierarchical

Multiple layers of orchestrators. Use when tasks decompose into deeply nested subtasks.

Agent reflects on its own failures, generates critique, and retries. Particularly powerful for code generation and factual tasks.

Reflexion · Loop
for attempt in range(max_retries):
    output = agent.generate(task)
    evaluation = evaluator_llm.score(output, rubric)

    if evaluation.score >= threshold:
        break

    reflection = reflector_llm.critique(
        task=task, output=output, score=evaluation
    )
    task = task + "\n\nPrevious attempt critique:\n" + reflection
🔌
07

MCP Implementation Guide

Minimal MCP Server (Python / FastMCP)

Python · FastMCP
from fastmcp import FastMCP
from pydantic import BaseModel

mcp = FastMCP("factory-tools", description="NeoFAB manufacturing tools")

# ── Tool: callable action ──
@mcp.tool()
def calculate_takt_time(demand_per_day: int, shift_hours: float = 8.0) -> dict:
    """Calculate TAKT time in seconds given daily demand and shift hours."""
    available_seconds = shift_hours * 3600
    takt = available_seconds / demand_per_day
    return {"takt_seconds": round(takt, 2), "demand": demand_per_day}

# ── Resource: read-only data ──
@mcp.resource("machines://catalog")
def get_machine_catalog() -> str:
    """Returns the full machine database as JSON."""
    return load_machine_db().to_json()

# ── Prompt template ──
@mcp.prompt()
def validate_line_prompt(line_name: str, takt: float) -> str:
    return f"Validate machine selections for {line_name} with TAKT={takt}s..."

if __name__ == "__main__":
    mcp.run(transport="stdio")  # or transport="sse" for remote

MCP Server via HTTP/SSE (remote / multi-tenant)

Python · FastMCP SSE
# Production: run as a service, connect over HTTP
mcp.run(
    transport="sse",
    host="0.0.0.0",
    port=8001,
    # Add auth middleware in production!
)

Client Registration (claude_desktop_config.json)

JSON · Config
{
  "mcpServers": {
    "factory-tools": {
      "command": "python",
      "args": ["/path/to/server.py"]
    },
    "remote-db": {
      "url": "http://localhost:8001/sse"
    }
  }
}

Tool Description is a First-Class Citizen

The docstring of your tool IS the prompt the LLM uses to decide when to call it. Be precise: what it does, what inputs it expects, and what it returns. Vague docstrings = wrong tool invocations.

📊
08

Evaluation & Observability

LLM-as-Judge for Agentic Pipelines

Python · LLM-as-Judge
from pydantic import BaseModel

class AgentEvalResult(BaseModel):
    correctness:  float   # 0-1: did it get the right answer?
    tool_usage:   float   # 0-1: did it use the right tools?
    efficiency:   float   # 0-1: fewest steps needed?
    faithfulness: float   # 0-1: no hallucinations?
    reasoning:    str     # explanation of scores

# Run judge against agent trace
judge_result = judge_llm.with_structured_output(AgentEvalResult).invoke({
    "task":       original_task,
    "agent_trace": tool_call_history,
    "final_answer": agent_output,
    "ground_truth": expected_answer  # optional
})

Key Metrics to Track

🔢

Step Count

Average tool calls per task. Proxy for efficiency. Spikes indicate prompt drift or tool confusion.

💸

Token Cost / Task

Agentic loops accumulate context fast. Track tokens per task, not per call. Set hard budgets.

🔄

Retry Rate

How often does the agent fail and retry? High retry rate = poor tool design or ambiguous prompts.

⏱️

End-to-End Latency

Multi-step latency can be 10-50× single-call latency. Use streaming and parallelism where possible.

🎯

Task Success Rate

Fraction of tasks completed correctly. Define "correct" explicitly with a rubric, not vibes.

🛑

Hallucination Rate

MCP tool results are ground truth. Any agent claim contradicting tool results = hallucination.

Observability Stack

🔍

Recommended Tools

LangSmith — trace every LangGraph step, inspect tool inputs/outputs. Weights & Biases — log eval metrics, compare runs. OpenTelemetry — for custom spans in production. Arize / Phoenix — drift detection over time. Always log: model, version, tool_calls, latency_ms, token_count, success/fail.

⚠️
09

Pitfalls, Loopholes & Anti-Patterns

Agentic AI Pitfalls

🔥 Infinite Loop / Runaway Agent

Agent gets stuck in a loop, making tool calls without converging. Always set max_iterations and a budget cap. Use LangGraph's interrupt/checkpoint mechanism.

🔥 Prompt Injection via Tool Results

A malicious tool result contains instructions like "Ignore previous instructions...". Sanitize all tool outputs before re-injecting into context. Never trust external data as instructions.

⚡ Context Window Explosion

Agentic loops accumulate tool results in context. A 20-step task with verbose tool outputs can exceed 128K tokens. Summarize observations, prune history, or use external memory (Redis).

⚡ Over-Reliance on the LLM's Plan

The model's plan is often wrong, especially for novel domains. Validate intermediate outputs structurally (Pydantic schemas). Use human-in-the-loop checkpoints for critical actions.

⚡ Tool Overload

Giving an agent 40+ tools degrades performance. LLMs struggle with large tool sets. Keep per-agent tool count under 10-15. Use routing to assign agents specialized subsets.

🔵 Non-Determinism in Evaluations

Two runs of the same task may take different paths. Don't evaluate on single runs. Use statistical aggregates over N runs (typically 10-50) for reliable metrics.

🔵 "Works in Dev, Fails in Prod" Drift

Dev runs on clean, small datasets; prod encounters messy, unexpected data. Test agents against adversarial inputs. Add fallback/default behaviors for unrecognized inputs.

MCP Pitfalls

🔥 No Auth on SSE Servers

An HTTP/SSE MCP server without auth is a public API. Always add bearer token middleware or mTLS in production. Never expose internal tools to the internet without auth.

🔥 Overly Permissive Tool Permissions

A execute_sql tool with write access is a footgun. Design tools with least privilege. Separate read tools from write tools. Gate destructive operations with confirmation.

⚡ Schema Drift

MCP server tool schema changes without notifying the LLM. The model's in-context tool list gets stale. Use versioning in tool names (v2_get_machine) and test after schema changes.

⚡ Poor Tool Descriptions

Vague docstrings cause wrong tool selection. The model is doing semantic matching on descriptions. Write descriptions as if explaining to a smart intern who doesn't know your system.

🔵 Long-running Tool Calls with No Timeout

An MCP tool that calls a slow API can hang the agent indefinitely. Add timeouts (e.g., 30s) to all tool calls. Return structured errors that the agent can reason about.

✅ Mitigation: Error Types Matter

Return structured error types (TIMEOUT, NOT_FOUND, AUTH_FAILED) so the agent can decide whether to retry, escalate, or skip. A generic "error" string is useless for recovery.

10

Best Practices

Agentic AI

  • Always define a maximum iteration limit and a token budget per task run.
  • Use structured outputs (Pydantic) for all inter-agent communication. Never pass raw strings between agents.
  • Design a human-in-the-loop checkpoint for any irreversible action (write to DB, send email, trigger API with side effects).
  • Keep agent system prompts short and precise. Bloated system prompts dilute the signal. Use structured role + constraint + tool guidance.
  • Build and run an offline eval harness before deploying changes. Track task success rate across versions.
  • Use checkpointing / persistence (LangGraph Checkpointer) so long-running agents survive restarts.
  • Instrument every tool call with structured logging: tool_name, args_hash, latency_ms, success, token_cost.
  • For multi-agent: define a clear handoff contract (schema) for what each agent passes to the next.
  • Test agents on adversarial inputs: empty results, malformed data, contradictory tool responses.
  • Prefer small, focused agents over one monolithic agent with 50 tools.

MCP

  • Write tool docstrings as LLM-first documentation: what the tool does, when to use it, what it returns.
  • Use Pydantic models as tool input schemas — they auto-generate JSON Schema and validate inputs.
  • Keep tools single-responsibility. One tool = one action. Don't build a Swiss army knife tool.
  • Return structured, typed data (not raw strings) so the LLM can reliably parse results.
  • Add request timeouts and circuit breakers for any tool that calls external APIs.
  • Version your tools. Use semantic versioning in the server name. Deprecate old versions explicitly.
  • Implement idempotency for write tools. The agent may call a tool multiple times on retry.
  • Log every MCP invocation with client_id, tool_name, input_hash, duration_ms, status.
🔐
11

Security Considerations

🔥 Prompt Injection Attacks

External data (web pages, DB rows, emails) fed to the agent can contain adversarial instructions. Always clearly delimit user data from instructions in prompts. Use XML tags: <data>...</data> and instruct the model never to follow instructions found inside data tags.

🔥 Credential Leakage via Tools

Agents can be prompted to call tools with credentials as arguments. Never expose credentials in tool inputs. Use server-side secret injection; tools should retrieve secrets from vault, not accept them as params.

⚡ Confused Deputy Problem

The LLM acts on behalf of the user but may be tricked into using elevated permissions for unauthorized tasks. Design tools with the permission level of the calling user, not the system.

⚡ MCP Server Spoofing

A malicious MCP server can return crafted tool descriptions to hijack agent behavior. Validate MCP servers against a known-good registry. Use mTLS for server identity verification.

✅ Sandbox Tool Execution

Run code-execution tools in isolated containers (Docker, gVisor). Limit filesystem access, network egress, and CPU time per tool execution.

✅ Audit Trail

Log every tool call in an immutable append-only log. In regulated environments (manufacturing, finance), this is non-negotiable for compliance and incident investigation.

📈
12

Scaling Patterns

Scale Agentic Throughput

  • Parallel sub-tasks: Use asyncio.gather for independent agent steps
  • 🗂️External memory: Redis / Postgres for agent state (not in-context)
  • 📊Batch processing: Anthropic Batch API for offline eval at low cost
  • 💾Prompt caching: Cache system prompt + tool schemas (large prefix)
  • 🔀Model routing: Use Haiku for simple tool selection, Opus for reasoning

Scale MCP Servers

  • 🐳Containerize: Each MCP server in its own Docker service
  • ⚖️Load balance: Multiple server instances behind a proxy
  • 📦Connection pooling: Pool DB connections inside MCP servers
  • 🔒Rate limiting: Per-client rate limits at the MCP layer
  • 🌐Caching layer: Cache read-heavy tool results (Redis TTL)

Cost Optimization

Cost Control Strategies
# 1. Prompt caching — static prefix cached at ~10% cost
client.messages.create(
    system=[{"type": "text", "text": LARGE_SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}}],  # cache this!
    ...
)

# 2. Model tiering — route by complexity
model = "claude-haiku-4-5" if task.complexity == "low" else "claude-opus-4-5"

# 3. Observation truncation — don't re-inject full tool result
truncated_obs = tool_result[:2000] + "... [truncated]" if len(tool_result) > 2000 else tool_result

# 4. Batch API for offline evals (50% cost savings)
batch = client.beta.messages.batches.create(requests=[...])
🔧
13

Tool Design Principles

The SMART Tool Framework

S — Specific

One tool = one action. get_machine_by_id not manage_machines. Specific names → correct LLM selection.

M — Minimal Input

Require only what's necessary. Optional params with sensible defaults. The LLM will hallucinate unknown required params.

A — Atomic

Tools should not call other tools internally. Composition is the orchestrator's job. Atomic tools are independently testable.

R — Rich Error Returns

Return structured errors with codes, not exceptions or empty strings. The agent needs actionable error info to recover.

T — Typed Outputs

Return typed data (dict with known keys) not raw strings. LLMs parse structured data much more reliably.

Tool Description Template

Good Tool Docstring
def get_station_cycle_time(station_id: str, product_line: str) -> dict:
    """
    Retrieve the measured cycle time for a manufacturing station.

    Use this when you need the actual (measured) cycle time for a station
    to compare against TAKT time. Do NOT use for theoretical times.

    Args:
        station_id: Station identifier (e.g., "ST-003", "WASH-01")
        product_line: Product line code (e.g., "2kWh", "12kWh", "HeroPack")

    Returns:
        {
          "station_id": str,
          "cycle_time_seconds": float,
          "last_measured": ISO8601 date,
          "measurement_count": int
        }

    Errors:
        STATION_NOT_FOUND — station_id does not exist
        LINE_NOT_FOUND    — product_line does not exist
        NO_DATA           — station exists but no measurements yet
    """
14

Quick Reference Cheatsheet

Topic Key Fact
MCP Transport (local) stdio — subprocess communication, zero network overhead
MCP Transport (remote) HTTP + SSE — stateful connection per session, supports auth headers
MCP Message Format JSON-RPC 2.0 — {"jsonrpc":"2.0","method":"tools/call","params":{},"id":1}
MCP Lifecycle initialize → list capabilities → call tools → terminate
ReAct max steps Default: 10-15. For complex tasks: 25-50. Always set explicitly.
LangGraph state TypedDict. All agent data lives in state. Nodes read + write state.
Prompt caching savings ~90% cost reduction on cached tokens. Min 1024 tokens to cache.
Best tool count per agent 5-15 tools. Over 20 → significant performance degradation.
Structured output reliability Pydantic + with_structured_output() > regex parsing > raw string parsing
Agent memory types In-context (ephemeral), External DB (persistent), Semantic (vector store)
FastMCP install uv add fastmcp (Python 3.10+)
LangGraph install uv pip install langgraph langchain-anthropic langchain-mcp-adapters
📌

The 3-Layer Stack for Production

Layer 1 — Tool Layer: MCP servers expose capabilities (domain tools, data access). Layer 2 — Agent Layer: LangGraph agents orchestrate multi-step reasoning using MCP tools. Layer 3 — Eval Layer: LLM-as-Judge + W&B track quality over time and catch regressions.