惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

宝玉的分享
宝玉的分享
S
SegmentFault 最新的问题
Google DeepMind News
Google DeepMind News
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
aimingoo的专栏
aimingoo的专栏
The Cloudflare Blog
博客园 - Franky
阮一峰的网络日志
阮一峰的网络日志
I
InfoQ
V
V2EX
P
Proofpoint News Feed
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
酷 壳 – CoolShell
酷 壳 – CoolShell
D
DataBreaches.Net
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
L
Lohrmann on Cybersecurity
Recent Announcements
Recent Announcements
Latest news
Latest news
P
Palo Alto Networks Blog
博客园_首页
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
S
Securelist
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
博客园 - 【当耐特】
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
MongoDB | Blog
MongoDB | Blog
Blog — PlanetScale
Blog — PlanetScale
NISL@THU
NISL@THU
博客园 - 聂微东
Hugging Face - Blog
Hugging Face - Blog
V
Visual Studio Blog
云风的 BLOG
云风的 BLOG
P
Privacy & Cybersecurity Law Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Cisco Talos Blog
Cisco Talos Blog
月光博客
月光博客
Security Latest
Security Latest
P
Proofpoint News Feed
小众软件
小众软件
T
Threat Research - Cisco Blogs
A
About on SuperTechFans
博客园 - 三生石上(FineUI控件)
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
爱范儿
爱范儿
罗磊的独立博客
Project Zero
Project Zero
W
WeLiveSecurity
U
Unit 42

LangChain Forum - Latest posts

Proposal: additional docs for implementing custom DB checkpointers or a guide on generic base checkpointer Langsmith Fleet Sandbox Failure Prompt_cache_retention: &#39;24h&#39; supported in langchain agents and where to provide it, inside invoke or while creating client? Could RAG pipelines realistically cause deployment timeouts, is Render suitable for first-time RAG deployments? How do I use langchain_postgres&#39; init_vectorstore_table correctly? Proposal: Graph-wide default error handler for StateGraph (fallback for nodes without error_handler) Support timedelta for CachePolicy.ttl, consistent with TimeoutPolicy The x402 illusion: Is advertising dead in the age of agents? Question about LangSmith Trace Search via API How to cancel a run correct !! Anyone confirms this issue that deepagent ui streaming is disturb by update in deepagent or bug issue Would pre-inference routing help long-context agent workflows? Best Stack for Building AI Applications Question about LangSmith Trace Search Seeking help regarding the connection between Websocket and tool calls Tool invocation error with empty error message when using `InjectedState` + `Command` return in async tool How to use @langchain/react FileSystem middleware Using ChatSnowflake with agents Built llmsessioncontract on AgentMiddleware: runtime enforcement of tool-call protocols — feedback wanted DeltaChannelHistory not found in langgraph-api:3.12 Improving citation accuracy and reducing hallucinations in custom Parent-Child RAG pipeline (Gemma3:4B + FAISS+BM25 + Cross-encoder reranker) Metadata filter not filtering for alerts Connecting the Slack integration fails with invalid_team_for_non_distributed_app Trouble understanding and editing experiment summary evaluators feedbacks SSL certificate error from httpx with LangGraph server WikipediaLoader endup in JSONDecodeError Human-in-the-loop approval dashboard for LangGraph agents — open source, free to deploy How are people handling data governance across agent handoffs in production? Feature Request: @task metadata Research: Friction Points in Agentic Commerce Transactions How should I provide an agent to a LangGraph server? Parallel astream() on the same compiled graph leaks messages between streams How I added claim verification to a LangChain agent in 5 minutes (with code) Interested in contributing a DynamoDB checkpointer for LangGraph.js A set of speculative ideas on treating LLM agents as interpreters with limited working memory, using externalized structures for reliable long-term project maintenance. Where should I define the name and description for subagents? Add Qdrant-backed checkpoint saver and memory store (langgraph-checkpoint-qdrant) The output content has been corrupted Incorrect `reasoning_effort` options in the UI A Minimal Receipt + Validator Pattern for Tool-Calling Agents Feature Request: Simple cryptographic provenance for who authorized what in LangGraph multi-agent graphs Using LangGraph interrupt for multi-step wizards with branching — right tool or wrong abstraction? Portable departure and admission records for LangChain agents Socratic Tools: Intelligent Agent Enhancement for LangChain Built NORNR for spend governance in agent workflows LangGraph + PostgreSQL: Chat history and summarization best practice Using SQLRecordManager multi-agent systems debugging agent-to-agent Proposal: Add save_local and load_local to USearch VectorStore (Feature Parity with FAISS) How are you handling agent security in production? (Identity, permissions, kill switch) RFC: Human delegation provenance for LangGraph multi-agent chains Build UX for Langgraph Tiny LangGraph -> Assay evidence sample from tasks v2 tool isolation between Skills: Bug: Cannot save edited prompt back to Prompt Hub after opening trace run in Playground Add faiss-node-native as async alternative vector store for FAISS Deep Agents: Clarify Multimodal (Image) Context Management and Compression Discussion about why LangGraph JS ToolNode doesn’t inject ToolRuntime.state like Python does, and what the correct workaround or intended design pattern is. xat-langchain: signed Agent-Signature headers on every tool call Help with local RAG pipeline – poor retrieval quality, wrong page numbers Building an AI networking agent (LangChain vs LangGraph vs Deep Agents)? LOGIC.md — declarative reasoning contracts that compile to LangGraph StateGraph Distinguishing internal vs final streamed chunks in Supervisor multi-agent architecture Machine-Readable Permissions for Web-Interacting Agents Are people hitting race conditions in multi-agent LangChain setups? How do ContextEditingMiddleware and SummarizationMiddleware interact when used together?Combining ContextEditingMiddleware + SummarizationMiddleware — execution order and behavior when both trigger? How we add runtime security to LangChain agents in production What does recursionLimit actually count in createAgent? (LangChain JS) LangSmith Playground returns 401 Invalid token for both /stream and /invoke across accounts How to register type in langgraph Are there any frontends for interacting with a LangGraph agent? Langchain.schema is not available while using in python code In-place model update on a compiled create_agent and per-subagent model update for deep agents - is this possible? Physical control surface for AI agents — feedback wanted Feature Request: Native Support for A2A Protocol (Remote Agents as Sub-Graphs) Feature Request: Image Input support for ChatMistralAI Cache disable in Deepagent Multiple response formats when creating agents? Unable to parse docstring from OpenAI schema Hosting an agent server on Heroku Enable OpenTelemetry traces without sending data to LangSmith No cost displayed in LangSmith when using LiteLLM + LangGraph Course certificate I had a few months Experience with across Agent deploys remotegraph calls - old/new dist runtime architecture Runs visiblty problem AI Agent Identity Management Framework How should I deploy a self-hosted multi-agent system? Free Self-hosted LangGraph Agent Server: how to separate API and queue workers using Docker LangSmith signup stuck on activation screen New integration: langchain-w2a — LangChain tools for W2A-enabled websites Custom routes not working self-hosted standalone `Failed to use model_dump to serialize <class 'langchain_core.tools.structured.StructuredTool'> to JSON: PydanticSerializationError(Unable to serialize unknown type: <class 'pydantic._internal._model_construction.ModelMetaclass'>)` `PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/` BaseSandbox.write() fails if file already exists -- any way to overwrite? What does the security architecture of AI agents actually look like? Persisting HITL payloads LangMem support in JS/TS Are dynamic tool lists allowed when using create_agent? How to make an image tool?
Trace-to-Fix: how are you actually improving RAG/agents after observability flags issues?
2026-04-09 · via LangChain Forum - Latest posts

@kamran-rapidfireAI This is just my mental model since evaluations have a lot of different opinions, and I usually make full use of Langsmith for evaluations.

My Mental Model: Trace → Dataset → Evaluator → Experiment → Regression


Step 1: Triage the Trace, Classify Why, Not Just That

When I see a bad trace, I pin down the exact failure category before anything else:

  • Retrieval miss → check the retriever span: right docs returned? What were the scores?
  • Bad tool call → check the tool span: what args did the model pass? Was the schema ambiguous?
  • Citation failure → check the final LLM span: did it have the right context but ignore it?
  • Chunking/reranking off → what chunks arrived vs. what actually made it into the response?

I tag the trace immediately (retrieval-miss, bad-tool-args) so I can filter and group later.

Observability concepts & spans · Add metadata & tags to traces


Step 2: Build a Failure Dataset — Turn Anecdotes Into Benchmarks

Once I have 5–10 traces with the same failure, I select them in LangSmith → “Add to Dataset” → name it rag-citation-failures-march-2026.

Each row carries: original input, bad output, and expected output (if I know it). This is my ground truth — inputs that should break my current system, so I can measure when I’ve fixed it.

Create datasets from traces (UI) · Manage datasets programmatically · Evaluation concepts


Step 3: Write Targeted Evaluators — Make the Failure Measurable

I don’t use generic evaluators. I write ones that directly test the failure I saw (these are just examples):

# Retrieval relevance
def retrieval_relevance(run, example):
    score = score_relevance(run.outputs["context"], example.inputs["question"])
    return {"key": "retrieval_relevance", "score": score}

# Citation enforcement
def citation_check(run, example):
    cited = any(s in run.outputs["answer"] for s in run.outputs["sources"])
    return {"key": "has_citation", "score": int(cited)}

# Tool call validity
def tool_args_valid(run, example):
    valid = validate_against_schema(run.outputs["tool_input"], example.metadata["expected_schema"])
    return {"key": "tool_args_valid", "score": int(valid)}

Now I have a number, not a feeling.

How to define a code evaluator (SDK) · How to define an LLM-as-judge evaluator · Return multiple scores in one evaluator


Step 4: Run Experiments: One Variable at a Time

I use evaluate() and change one thing per run. No shotgun sweeps.

from langsmith import evaluate

evaluate(
    chain_with_chunk_512,
    data="rag-citation-failures-march-2026",
    evaluators=[retrieval_relevance, citation_check],
    experiment_prefix="chunk-512-baseline",
)

evaluate(
    chain_with_chunk_256,
    data="rag-citation-failures-march-2026",
    evaluators=[retrieval_relevance, citation_check],
    experiment_prefix="chunk-256-experiment",
)

LangSmith shows both side-by-side with per-example score deltas — I can see exactly which failure cases got fixed, and which didn’t.

I iterate the same way for: top_k, reranker on/off, prompt wording, tool schema rewrites.

How to evaluate an LLM application · Compare experiment results · Analyze an experiment


Step 5: A/B in Production: Validate on Real Traffic

Once a config wins offline, I shadow-test ~15% of real traffic before full rollout. Both versions log to LangSmith with metadata={"variant": "v1"} / {"variant": "v2"}. After a day, I filter by variant and compare evaluator scores. This catches distribution shift my failure dataset didn’t cover.

Filter traces in the application · Set up LLM-as-a-judge online evaluators · Set up automation rules


Step 6: Lock It as a Regression Test: Never Regress Silently (Again just an example)

# tests/test_regressions.py
def test_no_citation_regressions():
    results = evaluate(
        production_chain,
        data="rag-citation-failures-march-2026",
        evaluators=[citation_check],
    )
    avg_score = results.to_pandas()["has_citation"].mean()
    assert avg_score >= 0.90, f"Citation quality dropped: {avg_score}"

Every future PR runs against my known hard cases automatically.

How to run evaluations with pytest · CI/CD pipeline example


The Summary Table

Stage What I’m doing LangSmith feature
Trace triage Classify why it failed Spans, metadata tags
Dataset Turn failure cases into a benchmark Datasets
Evaluators Make the failure mode measurable Code / LLM-as-judge evaluators
Experiments Test one hypothesis at a time evaluate(), experiment comparison
A/B Validate on real traffic Metadata filtering, online eval
Regression suite Prevent silent regressions pytest + CI/CD integration

The core principle I adhere to: a trace you close without adding to a dataset is a learning opportunity permanently lost. The whole loop only works when failures become data points.