慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
何察于AI之使,于发之先(及后)
SapotaCorp · 2026-05-24 · via DEV Community

与吾共事之创者,发其人工智能之器六月后,有短信来曰:"有异事矣。今月之费,倍增至三,应时渐恶,然实不知系统何部为患。助我。"

此系统乃多智能体之客户支援构架,设三智能体,具四工具。其团队于多数方面建之甚善:提示坚实,工具融贯,启运前评估中规。然有一事未及:可察性。六周之生产交通,过此系统而无所踪,无每步延迟之追踪,无成本归因。

辨其症需旬日之劳,盖因当日所当置而未置,今乃溯而补之。而愈之,不过日暮。及至察得实症(一代理器,于供应商模型更新后频作循环),则团队已耗数千金于无谓之LLM,且失数周之客信矣。

此乃最可避免之代理失效类别。最低可观测栈于发布前安装几无成本,且当有变故时,可省数周之调试。此乃Sapota随每生产代理所附之内容。

可察之失有三

智能体于生产中,其败亡有三,皆结构迥异。非有监测之术,不可察焉。

默然之质渐落。代理之应时日愈劣。语料流移,模型供者更其本重,提示模板得修,用户询诘之分布亦迁。质之损,每周五分之壹,阃奥者两月方觉,是时客报之误已激,信义遂隳。

成本骤增。生产成本较测试预期高出三至十倍。或因生产查询较之测试查询更为繁复,或因特定用户滥用手法,或因多代理配置中某代理频现循环。无请求与代理之成本追踪,则难辨其由。

崩坏相续。器A始逾时。吏B复之。吏C候吏B之效。全系滞若蜗行,然无独器可警。九十五分位之迟滞倍三,然平均犹佳。无踪迹于吏系,崩状不可见。

此等崩状,皆耗真器之实财于吾所察之核。无有必至。

层一:追迹

最小追迹设置,每请求数据:

  • 请求ID,遍历各步骤传递
  • 用户ID,用于成本归属与滥用检测
  • 各LLM调用之输入输出(输入符号、输出符号、延迟、成本)
  • 工具调用之输入输出、延迟及错误
  • 退出之由:成功,至多迭代次数已至,超时,中止

其实现非复复杂。Langfuse、Opik、LangSmith及Helicone,皆以十行之码,裹其LLM之客。其边际之费,约五分之迟滞,及储藏之费,约一成之LLM之费。此二者皆可忽而不计。

吾等多以Langfuse自建为本,此开源,无单次追踪计价,单于小VM运行。若已入LangChain之域,LangSmith亦可。欲简设者,Opik为善。择一而安之,俟启事之先。

实用之效:物有毁坏,其迹可明其处。创者之环代理人,现于迹中为“代理B退出因:最大迭代次数”,较基准多现四倍。无迹则此模式难察。

第二层:指标

以迹为基所建之汇总仪表盘。生产代理所重之指标:

品質之標準:

  • 任务成事之率(指代理达至"成功"之出口,非最大迭代或中止之请求比例)
  • 忠实度得分(若有忠实度之关卡,则追踪其输出)
  • 用户满意(若具反馈收集,则计其赞许率)

性能指标:

  • p50、p95、p99之延迟(平均数隐匿尾部)
  • 每任务迭代次数(高迭代或致循环终止)
  • 工具调用频次

成本指标:

  • 每任务成本(均值、p95、最大值)
  • 每日每用户成本(防滥用与失控成本)
  • 按模型成本(若多模型配置)

可靠性指标:

  • 错误率按错误类型分(暂时的、永久的、验证的)
  • 工具故障率按工具
  • 备用使用率

有八至十二个指标的生产行政台,涵盖所需所见之大部分。吾辈通常建于Grafana之上,基于踪迹存储,或于Langfuse / Opik / LangSmith的原生行政台构建之。

第三层:警报

仪表盘上的指标有助于探究。警报则在客户察觉问题之前捕捉到它们。

最小警报设定:

錯誤率驟增。若錯誤率超過百分之五於任何五分鐘之間,則發警報。可察突發之故障,由工具停運、模型供應問題,或部署不良所致。

迟滞渐损。若p95迟滞逾越尔之SLA十分钟,则警。能察渐损之由,或增负、或重试、或下游之弊。

超支。若日耗逾预算之阈,则警。能察失控之环、滥用、或模型供者之价变。

质降。若周评分数跌逾五,则警。察客先觉之默损。

要务失应。若外依断路,则警。防倾颓之祸,俟其未倾。

警讯之规五则乃其下限。众司事之能,随团队渐悟所察,积规益多。其不效之式:无警讯之规,曰"偶检仪表盘耳"。

第四层:评估之流

第四层乃离线评估。踪迹示生产之状,评估明其状是否当实。

最小评估流程:

  • 一百至五百之真确评估题,附预期答案
  • 每周定时运行评估题于生产流程
  • 追踪通过率之变迁
  • 若通过率周降逾五,则发警报

吾等以Ragas为众客之度,其能于一试中,算诚信、答之切、境之忆、答之正。百题周试之算费,不逾五钱。

评估之集乃众团队所轻视之物。其当映真实之生产查询分布,非惟工程之师所思用户将询者。欲建良善之评估之集,当俟两星期之生产交通既成,采百实查询,撰所期之答,以之为将来之真凭。每季更新之.

吾辈于创始人之系统所察

启程六周后,吾等施以器测,迹立显:

  • 有独一之器(即路由之器)出,频率较之启程,倍四焉
  • 滞时之p95,自四秒迁至十二秒
  • 每务之费,自四分迁至十三分
  • 变更之始,恰逢彼时基础模型提供者推更新于三周之前

此策迅捷:新模版本解一指令异于旧者,而路由代理频求详释于旧模所直应之请。吾等增二例以砺路由之示。代理止其回环。费与迟复归原基。

既得踪迹,全案之查四时而已。无踪迹时,团队六周不能定其谬。

建议:宜于发布前安装

待观其明而后动之论曰:"吾辈行速,俟发轫后加之。"先于发轫而施之论曰:"后加之费甚巨(逆施之器,失其数,灾变之惊),今施之费几于无。"

至简之栈:

  1. 追踪(Langfuse / Opik / LangSmith):一日可成,于代码中传布请求数。
  2. 仪表盘列八至十二之度:一日可建,于追踪之上。
  3. 五则警律:一日可立,并接于警讯之系。
  4. 评量之管(Ragas + 按时):二日可设,于评量之集,每周行之。

一星期之功。不若排解一桩生产之谜所费之时。

若尔之代理人已投入生产而无痕迹

若汝之团队已发布人工智能之代理,而汝不能以数据应"何以上周响应时变慢",则其隙在可察性。

萨波塔提供一周内可实现的观测性方案,安装Langfuse(或您所偏好的技术栈),对现有代理代码进行监测,构建生产看板,配置警报规则,并搭建评估流程。吾等已为六七代理系统完成此务,多见于未加监测即交付、后始遇神秘生产问题的团队。

通过人工智能工程页面联系吾等。 当下代理栈之状,及所见之问题。首度交谈,常显出缺失之层,及当先安装者。