Trace-to-Fix: how are you actually improving RAG/agents after observability flags issues?

LangChain Forum - Latest posts

Proposal: additional docs for implementing custom DB checkpointers or a guide on generic base checkpointer Langsmith Fleet Sandbox Failure Prompt_cache_retention: '24h' supported in langchain agents and where to provide it, inside invoke or while creating client? Could RAG pipelines realistically cause deployment timeouts, is Render suitable for first-time RAG deployments? How do I use langchain_postgres' init_vectorstore_table correctly? Proposal: Graph-wide default error handler for StateGraph (fallback for nodes without error_handler) Support timedelta for CachePolicy.ttl, consistent with TimeoutPolicy The x402 illusion: Is advertising dead in the age of agents? Question about LangSmith Trace Search via API How to cancel a run correct ！！ Anyone confirms this issue that deepagent ui streaming is disturb by update in deepagent or bug issue Would pre-inference routing help long-context agent workflows? Best Stack for Building AI Applications Question about LangSmith Trace Search Seeking help regarding the connection between Websocket and tool calls Tool invocation error with empty error message when using `InjectedState` + `Command` return in async tool How to use @langchain/react FileSystem middleware Using ChatSnowflake with agents Built llmsessioncontract on AgentMiddleware: runtime enforcement of tool-call protocols — feedback wanted DeltaChannelHistory not found in langgraph-api:3.12 Improving citation accuracy and reducing hallucinations in custom Parent-Child RAG pipeline (Gemma3:4B + FAISS+BM25 + Cross-encoder reranker) Metadata filter not filtering for alerts Connecting the Slack integration fails with invalid_team_for_non_distributed_app Trouble understanding and editing experiment summary evaluators feedbacks SSL certificate error from httpx with LangGraph server WikipediaLoader endup in JSONDecodeError Human-in-the-loop approval dashboard for LangGraph agents — open source, free to deploy How are people handling data governance across agent handoffs in production? Feature Request: @task metadata Research: Friction Points in Agentic Commerce Transactions How should I provide an agent to a LangGraph server? Parallel astream() on the same compiled graph leaks messages between streams How I added claim verification to a LangChain agent in 5 minutes (with code) Interested in contributing a DynamoDB checkpointer for LangGraph.js A set of speculative ideas on treating LLM agents as interpreters with limited working memory, using externalized structures for reliable long-term project maintenance. Where should I define the name and description for subagents? Add Qdrant-backed checkpoint saver and memory store (langgraph-checkpoint-qdrant) The output content has been corrupted Incorrect `reasoning_effort` options in the UI A Minimal Receipt + Validator Pattern for Tool-Calling Agents Feature Request: Simple cryptographic provenance for who authorized what in LangGraph multi-agent graphs Using LangGraph interrupt for multi-step wizards with branching — right tool or wrong abstraction? Portable departure and admission records for LangChain agents Socratic Tools: Intelligent Agent Enhancement for LangChain Built NORNR for spend governance in agent workflows LangGraph + PostgreSQL: Chat history and summarization best practice Using SQLRecordManager multi-agent systems debugging agent-to-agent Proposal: Add save_local and load_local to USearch VectorStore (Feature Parity with FAISS) How are you handling agent security in production? (Identity, permissions, kill switch) RFC: Human delegation provenance for LangGraph multi-agent chains Build UX for Langgraph Tiny LangGraph -> Assay evidence sample from tasks v2 tool isolation between Skills: Bug: Cannot save edited prompt back to Prompt Hub after opening trace run in Playground Add faiss-node-native as async alternative vector store for FAISS Deep Agents: Clarify Multimodal (Image) Context Management and Compression Discussion about why LangGraph JS ToolNode doesn’t inject ToolRuntime.state like Python does, and what the correct workaround or intended design pattern is. xat-langchain: signed Agent-Signature headers on every tool call Help with local RAG pipeline – poor retrieval quality, wrong page numbers Building an AI networking agent (LangChain vs LangGraph vs Deep Agents)? LOGIC.md — declarative reasoning contracts that compile to LangGraph StateGraph Distinguishing internal vs final streamed chunks in Supervisor multi-agent architecture Machine-Readable Permissions for Web-Interacting Agents Are people hitting race conditions in multi-agent LangChain setups? How do ContextEditingMiddleware and SummarizationMiddleware interact when used together?Combining ContextEditingMiddleware + SummarizationMiddleware — execution order and behavior when both trigger? How we add runtime security to LangChain agents in production What does recursionLimit actually count in createAgent? (LangChain JS) LangSmith Playground returns 401 Invalid token for both /stream and /invoke across accounts How to register type in langgraph Are there any frontends for interacting with a LangGraph agent? Langchain.schema is not available while using in python code In-place model update on a compiled create_agent and per-subagent model update for deep agents - is this possible? Physical control surface for AI agents — feedback wanted Feature Request: Native Support for A2A Protocol (Remote Agents as Sub-Graphs) Feature Request: Image Input support for ChatMistralAI Cache disable in Deepagent Multiple response formats when creating agents? Unable to parse docstring from OpenAI schema Hosting an agent server on Heroku Enable OpenTelemetry traces without sending data to LangSmith No cost displayed in LangSmith when using LiteLLM + LangGraph Course certificate I had a few months Experience with across Agent deploys remotegraph calls - old/new dist runtime architecture Runs visiblty problem AI Agent Identity Management Framework How should I deploy a self-hosted multi-agent system? Free Self-hosted LangGraph Agent Server: how to separate API and queue workers using Docker LangSmith signup stuck on activation screen New integration: langchain-w2a — LangChain tools for W2A-enabled websites Custom routes not working self-hosted standalone `Failed to use model_dump to serialize <class 'langchain_core.tools.structured.StructuredTool'> to JSON: PydanticSerializationError(Unable to serialize unknown type: <class 'pydantic._internal._model_construction.ModelMetaclass'>)` `PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/` BaseSandbox.write() fails if file already exists -- any way to overwrite? What does the security architecture of AI agents actually look like? Persisting HITL payloads LangMem support in JS/TS Are dynamic tool lists allowed when using create_agent? How to make an image tool?

2026-04-09 · via LangChain Forum - Latest posts

@kamran-rapidfireAI This is just my mental model since evaluations have a lot of different opinions, and I usually make full use of Langsmith for evaluations.

My Mental Model: Trace → Dataset → Evaluator → Experiment → Regression

Step 1: Triage the Trace, Classify Why, Not Just That

When I see a bad trace, I pin down the exact failure category before anything else:

Retrieval miss → check the retriever span: right docs returned? What were the scores?
Bad tool call → check the tool span: what args did the model pass? Was the schema ambiguous?
Citation failure → check the final LLM span: did it have the right context but ignore it?
Chunking/reranking off → what chunks arrived vs. what actually made it into the response?

I tag the trace immediately (retrieval-miss, bad-tool-args) so I can filter and group later.

Observability concepts & spans · Add metadata & tags to traces

Step 2: Build a Failure Dataset — Turn Anecdotes Into Benchmarks

Once I have 5–10 traces with the same failure, I select them in LangSmith → “Add to Dataset” → name it rag-citation-failures-march-2026.

Each row carries: original input, bad output, and expected output (if I know it). This is my ground truth — inputs that should break my current system, so I can measure when I’ve fixed it.

Create datasets from traces (UI) · Manage datasets programmatically · Evaluation concepts

Step 3: Write Targeted Evaluators — Make the Failure Measurable

I don’t use generic evaluators. I write ones that directly test the failure I saw (these are just examples):

# Retrieval relevance
def retrieval_relevance(run, example):
    score = score_relevance(run.outputs["context"], example.inputs["question"])
    return {"key": "retrieval_relevance", "score": score}

# Citation enforcement
def citation_check(run, example):
    cited = any(s in run.outputs["answer"] for s in run.outputs["sources"])
    return {"key": "has_citation", "score": int(cited)}

# Tool call validity
def tool_args_valid(run, example):
    valid = validate_against_schema(run.outputs["tool_input"], example.metadata["expected_schema"])
    return {"key": "tool_args_valid", "score": int(valid)}

Now I have a number, not a feeling.

How to define a code evaluator (SDK) · How to define an LLM-as-judge evaluator · Return multiple scores in one evaluator

Step 4: Run Experiments: One Variable at a Time

I use evaluate() and change one thing per run. No shotgun sweeps.

from langsmith import evaluate

evaluate(
    chain_with_chunk_512,
    data="rag-citation-failures-march-2026",
    evaluators=[retrieval_relevance, citation_check],
    experiment_prefix="chunk-512-baseline",
)

evaluate(
    chain_with_chunk_256,
    data="rag-citation-failures-march-2026",
    evaluators=[retrieval_relevance, citation_check],
    experiment_prefix="chunk-256-experiment",
)

LangSmith shows both side-by-side with per-example score deltas — I can see exactly which failure cases got fixed, and which didn’t.

I iterate the same way for: top_k, reranker on/off, prompt wording, tool schema rewrites.

How to evaluate an LLM application · Compare experiment results · Analyze an experiment

Step 5: A/B in Production: Validate on Real Traffic

Once a config wins offline, I shadow-test ~15% of real traffic before full rollout. Both versions log to LangSmith with metadata={"variant": "v1"} / {"variant": "v2"}. After a day, I filter by variant and compare evaluator scores. This catches distribution shift my failure dataset didn’t cover.

Filter traces in the application · Set up LLM-as-a-judge online evaluators · Set up automation rules

Step 6: Lock It as a Regression Test: Never Regress Silently (Again just an example)

# tests/test_regressions.py
def test_no_citation_regressions():
    results = evaluate(
        production_chain,
        data="rag-citation-failures-march-2026",
        evaluators=[citation_check],
    )
    avg_score = results.to_pandas()["has_citation"].mean()
    assert avg_score >= 0.90, f"Citation quality dropped: {avg_score}"

Every future PR runs against my known hard cases automatically.

How to run evaluations with pytest · CI/CD pipeline example

The Summary Table

Stage	What I’m doing	LangSmith feature
Trace triage	Classify why it failed	Spans, metadata tags
Dataset	Turn failure cases into a benchmark	Datasets
Evaluators	Make the failure mode measurable	Code / LLM-as-judge evaluators
Experiments	Test one hypothesis at a time	`evaluate()`, experiment comparison
A/B	Validate on real traffic	Metadata filtering, online eval
Regression suite	Prevent silent regressions	pytest + CI/CD integration

The core principle I adhere to: a trace you close without adding to a dataset is a learning opportunity permanently lost. The whole loop only works when failures become data points.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

LangChain Forum - Latest posts

Step 1: Triage the Trace, Classify Why, Not Just That

Step 2: Build a Failure Dataset — Turn Anecdotes Into Benchmarks

Step 3: Write Targeted Evaluators — Make the Failure Measurable

Step 4: Run Experiments: One Variable at a Time

Step 5: A/B in Production: Validate on Real Traffic

Step 6: Lock It as a Regression Test: Never Regress Silently (Again just an example)

The Summary Table