PydanticAI vs LangChain - Choosing an Agent Framework for Production, Not Demos

In a recent audit, a team showed me an AI assistant they'd built on top of their company knowledge base. The demo had landed well: ask how to use a feature, and it walked through the exact pain point their support queue kept seeing. Leadership signed off.

In production, the same agent told a user to open a menu option that didn't exist. Not a vague answer - a specific UI path, stated with confidence. Nobody caught it in testing. It surfaced when I audited the system, not when a user complained.

The prototype passed testing because nobody was checking whether the answer matched the product. In production, that gap becomes a liability: the model invents UI paths, and your backend has no schema to reject them.

When you're choosing an agent framework, popularity is the wrong scorecard. Pick the one that fails loudly in development and gracefully in production - or you'll find out in audit.

What "Production-Ready" Actually Requires

Tutorial agents are built to impress in a fifteen-minute demo. Production agents run unattended, handle bad inputs, and ship answers your backend has to trust. The gap between those two goals is where most teams stumble - and it's rarely visible until something reaches a user.

When I audit agent codebases, I evaluate five things the tutorials skip:

Structured, validated outputs: Can your system reject an invented menu path before it becomes user-facing advice?
Dependency injection for testing: Can you swap the knowledge base for a mock in CI without rewiring the agent?
Retry and error handling: When the model returns malformed output, does the framework retry - or do you ship a parser exception?
Observability hooks: Can you trace which document grounded a bad answer when support escalates?
Type-checker support: Will static analysis catch a breaking API change before deploy, or after the agent silently misbehaves?

If you want to score your own system, the Production Readiness Audit covers the same five categories - deployment, observability, failure modes, and a prioritized remediation plan.

Side-by-Side: The Same Agent, Two Frameworks

The first item on the rubric is structured, validated outputs. The clearest way to see the framework difference is to build the same agent twice.

The task: answer natural-language questions about a CSV of sales data. The agent calls a tool to query the file, then returns a structured answer your API can pass downstream without a second parsing step.

LangChain

from langchain.agents import create_agent
from langchain.tools import tool

@tool
def query_sales_csv(region: str) -> str:
    """Return total revenue for a region in the sales CSV."""
    total = df.loc[df["region"] == region, "revenue"].sum()
    return f"{region}: ${total:,.0f}"

agent = create_agent("anthropic:claude-sonnet-4-6", tools=[query_sales_csv])
result = agent.invoke({
    "messages": [{"role": "user", "content": "What was Q1 revenue in Europe?"}],
})

answer = result["messages"][-1].content  # str — you validate the shape yourself

This is the pattern most tutorials teach. The tool works, the agent runs, the demo looks fine. But answer is a string (or occasionally a dict, depending on the model). Nothing in this flow checks that the response contains a real region name, a numeric revenue, or the right currency. If the model formats the answer as prose instead of data, your code finds out in production - or in audit.

LangChain does support a response_format parameter with Pydantic models. It's opt-in, and most teams I audit haven't wired it up yet.

PydanticAI

from pydantic import BaseModel
from pydantic_ai import Agent

class SalesAnswer(BaseModel):
    region: str
    total_revenue: float
    currency: str = "USD"

agent = Agent("anthropic:claude-sonnet-4-6", output_type=SalesAnswer)

@agent.tool_plain
def query_sales_csv(region: str) -> float:
    """Return total revenue for a region in the sales CSV."""
    return float(df.loc[df["region"] == region, "revenue"].sum())

result = agent.run_sync("What was Q1 revenue in Europe?")
answer = result.output  # SalesAnswer — validated before your code runs

Here, validation isn't a step you add later - it's the contract. output_type=SalesAnswer tells the agent what shape to return. If the model produces something that doesn't match - wrong field, missing revenue, invented region - PydanticAI raises before your application code touches it. You get a SalesAnswer object your type checker understands, not a string you hope to parse.

Same task, same tool, same model. The difference is what happens after the LLM responds: LangChain hands you text and trusts you'll validate it; PydanticAI hands you a typed object or fails immediately.

Dependency Injection & Testability

Validated outputs tell you the shape is right. Dependency injection tells you the data is right - and lets you prove it without calling a live API on every CI run.

Agent tools don't operate in a vacuum. They read from databases, knowledge bases, and internal APIs. In production, those dependencies are real. In tests, they need to be fake - predictable, fast, and free. The question is whether your framework makes that swap explicit or forces you to hack around it.

PydanticAI: dependencies as a first-class parameter

PydanticAI declares what an agent needs via deps_type. Tools receive a RunContext and pull dependencies from ctx.deps. At run time, you pass the real implementation; in tests, you pass a fake.

from dataclasses import dataclass
from pydantic_ai import Agent, RunContext

@dataclass
class SalesDataSource:
    def revenue_for(self, region: str) -> float:
        return float(df.loc[df["region"] == region, "revenue"].sum())

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    deps_type=SalesDataSource,
    output_type=SalesAnswer,
)

@agent.tool
def query_sales_csv(ctx: RunContext[SalesDataSource], region: str) -> float:
    return ctx.deps.revenue_for(region)

# Production: agent.run_sync(prompt, deps=SalesDataSource())
# Test:       agent.run_sync(prompt, deps=FakeSalesData(revenue=1_250_000))

The type checker enforces the contract. If a tool expects SalesDataSource and you pass something else, mypy catches it before merge. Your test injects FakeSalesData(revenue=1_250_000) and asserts the agent's structured output matches - no CSV file, no network, no API key in CI.

LangChain: it works, but the seams are yours to find

LangChain agents can be tested, but the framework doesn't give you an injection point. The usual pattern is a module-level dependency the tool closes over, then unittest.mock.patch in tests:

from unittest.mock import patch
from langchain.agents import create_agent
from langchain.tools import tool

data_source = SalesDataSource()  # module-level — no framework injection point

@tool
def query_sales_csv(region: str) -> str:
    """Return total revenue for a region in the sales CSV."""
    return f"{region}: ${data_source.revenue_for(region):,.0f}"

agent = create_agent("anthropic:claude-sonnet-4-6", tools=[query_sales_csv])

# Test: patch the module where data_source lives
with patch("myapp.sales_agent.data_source", FakeSalesData(revenue=1_250_000)):
    result = agent.invoke({"messages": [{"role": "user", "content": "Q1 Europe?"}]})

Here you're patching a string path - "myapp.sales_agent.data_source" - that must match exactly where the module is imported. Rename the file, change the import structure, or run tests in parallel that share a patched global, and you get flakes or false greens.

If you've fought flaky agent tests, you've lived this. The test doesn't fail because the agent logic is wrong; it fails because the test setup is fighting the framework's defaults.

PydanticAI doesn't eliminate the need to write tests. It gives you a seam that was designed for swapping. That's the difference between "we test our agents in CI" and "we test our agents in CI reliably."

Error Handling and Retries in Practice

When the model returns garbage, what happens next? In production, "garbage" isn't always obvious - it's a well-formed JSON object with total_revenue: "approximately high" or a region name the CSV doesn't contain. The framework should catch that and recover, not pass it to your API.

PydanticAI: validation failures feed back to the model

When output_type is a Pydantic model, schema violations don't reach your application code. PydanticAI sends the validation error back to the model and retries:

from pydantic_ai import Agent, ModelRetry

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    output_type=SalesAnswer,
    retries=3,
)

@agent.output_validator
def must_have_revenue(ctx, output: SalesAnswer) -> SalesAnswer:
    if output.total_revenue <= 0:
        raise ModelRetry("Revenue must be positive. Call query_sales_csv and retry.")
    return output

retries=3 handles structural failures - wrong types, missing fields, malformed JSON. The @output_validator handles business rules the schema can't express. Both paths raise ModelRetry, which tells the agent to try again with the error message as context. If all retries exhaust, you get an explicit exception - not a silent bad record in your database.

LangChain: you assemble the loop

LangChain can validate structured output via response_format, but retry orchestration is still your code. Schema errors, business rules, retry limits, and message history - you wire it together:

from pydantic import ValidationError
from langchain.agents import create_agent

MAX_RETRIES = 3

agent = create_agent(
    "anthropic:claude-sonnet-4-6",
    tools=[query_sales_csv],
    response_format=SalesAnswer,
)

messages = [{"role": "user", "content": "What was Q1 revenue in Europe?"}]
for attempt in range(MAX_RETRIES):
    result = agent.invoke({"messages": messages})
    try:
        answer = SalesAnswer.model_validate(result["structured_response"])
        if answer.total_revenue <= 0:
            raise ValueError("Revenue must be positive")
        break
    except (ValidationError, ValueError) as e:
        messages = result["messages"] + [
            {"role": "user", "content": f"Validation failed: {e}. Try again."}
        ]
else:
    raise RuntimeError("Max retries exceeded")

LangChain also offers ToolStrategy with a handle_errors parameter for schema-level retries - closer to PydanticAI's defaults. But business-rule validation like total_revenue > 0 still lands in your loop. And teams using no response_format pattern write even more of this by hand: parse JSON from message text, catch OutputParserException, append errors, track attempts.

The operational difference: PydanticAI treats "output didn't validate" as a normal agent loop event. LangChain treats it as an exception you handle - if you remembered to write the handler.

Testing Without Hitting the API

In the previous sections, you saw how to test agent logic. This section is about testing without paying for it. Every CI run that calls a real LLM costs money, adds latency, and flakes when the model paraphrases. Most teams know this; fewer build around it.

PydanticAI ships test doubles for the model itself. TestModel replaces the LLM with deterministic Python: it calls your tools, generates schema-valid output, and never hits the network. FunctionModel goes further - you write the mock responses in plain Python when you need specific behavior.

from pydantic_ai.models.test import TestModel

def test_sales_agent_returns_validated_output():
    with agent.override(model=TestModel()):
        result = agent.run_sync(
            "What was Q1 revenue in Europe?",
            deps=FakeSalesData(revenue=1_250_000),
        )
    assert isinstance(result.output, SalesAnswer)
    assert result.output.total_revenue > 0

agent.override(model=TestModel()) swaps the model at the agent boundary - the same boundary where you pass deps= in previous sections. Your application code doesn't change; your test doesn't need an API key.

LangChain: script every model turn

LangChain's equivalent is GenericFakeChatModel - you pass an iterator of scripted AIMessage responses, one per model invocation in the agent loop:

from langchain_core.language_models.fake_chat_models import GenericFakeChatModel
from langchain_core.messages import AIMessage, ToolCall

fake_model = GenericFakeChatModel(messages=iter([
    AIMessage(content="", tool_calls=[
        ToolCall(name="query_sales_csv", args={"region": "Europe"}, id="call_1"),
    ]),
    AIMessage(content='{"region": "Europe", "total_revenue": 1250000.0}'),
]))

with patch("myapp.sales_agent.data_source", FakeSalesData(revenue=1_250_000)):
    test_agent = create_agent(
        fake_model, tools=[query_sales_csv], response_format=SalesAnswer,
    )
    result = test_agent.invoke({"messages": [{"role": "user", "content": "Q1 Europe?"}]})
    assert SalesAnswer.model_validate(result["structured_response"]).total_revenue > 0

Here you're scripting each turn: first a tool call, then a JSON payload. Add a retry path or a second tool and you extend the iterator. Miss a turn and the test fails with an opaque StopIteration.

For teams running agent tests on every PR, that's the single biggest cost-saving win I see when moving to PydanticAI: tests on every commit, no API bill, assertions on typed output.

Where LangChain Still Wins

This article argues for PydanticAI on production grounds. That case is weaker if you're not heading to production yet - and LangChain earns its popularity for good reasons.

Integration breadth: LangChain connects to more data sources, vector stores, and model providers out of the box. If you need a connector that PydanticAI doesn't ship yet - a niche CRM, an internal protobuf service, a legacy search index - LangChain's community has probably already built it.

Prototyping speed: Pre-built chains, LangGraph templates, and copy-paste tutorials get a demo in front of stakeholders fast. For a two-week proof of concept where the goal is "show the CEO something that talks", that velocity matters more than typed outputs.

Ecosystem maturity: More Stack Overflow answers, more LangSmith integrations, more hiring-market familiarity. If your team already knows LangChain and the project might not ship, switching frameworks adds cost with no payoff.

None of this makes LangChain the wrong choice for production - teams ship reliable LangChain agents every day. But they do it by adding the validation, testing, and error-handling layers this article shows PydanticAI includes by default. If you're building those layers yourself anyway, the framework choice matters less.

Choose LangChain when speed and integrations beat type safety. Choose PydanticAI when you're done demoing and need the agent to run without you in the room.

Decision Checklist

Bring this to your next architecture review. Seven yes/no questions - answer for the project you're actually building, not the demo you already shipped:

Do you need typed outputs your backend can trust without re-validation?
Will this agent run unattended in production?
Do you need to unit-test agent logic in CI without API calls?
When the model returns malformed output, should the framework retry automatically?
Do you need to trace which document or tool call produced a given answer?
Will static analysis catch breaking changes to your output schema before deploy?
Do you need a niche integration that only exists in LangChain's ecosystem today?

Scoring: Three or more yes on questions 1–6 → lean PydanticAI. Mostly no, or yes only on question 7 → LangChain is fine for now. Yes on 1–3 but no on 4–6 → you're heading to production but haven't built the hard parts yet. Either framework works, but budget time for the gaps this article mapped in sections 3–6.

For a full assessment across your codebase - not just the framework choice - see the Production Readiness Audit. Same categories, applied to what you've actually shipped.

Conclusion

If the checklist pointed you toward PydanticAI but you're already on LangChain, you don't need a rewrite.

Your tools - the functions that query databases, search knowledge bases, and call internal APIs - are plain Python. Wrap them as PydanticAI @agent.tool handlers and migrate one agent at a time. Run both frameworks side by side during the transition; retire LangChain paths as each agent passes the production tests you couldn't write before.

The framework decision is one input. The harder question is whether your agent stack is ready for what happens after the demo - the invented menu options, the silent validation gaps, the CI runs that still hit the API.

Book a 20-minute intro call and tell me what you're building. Or jump straight to an Architecture Call or an AI Code Health Check.

Follow me on Twitter: https://twitter.com/DevAsService

Follow me on Instagram: https://www.instagram.com/devasservice/

Follow me on TikTok: https://www.tiktok.com/@devasservice

Follow me on YouTube: https://www.youtube.com/@DevAsService