慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
减去模型转换,将切割代理延迟从30秒降至8秒
SapotaCorp · 2026-05-24 · via DEV Community

SapotaCorp

有创者以工技之问,隐其 UX 之困,而叩吾等。其众既发 AI 对话之产,用户未待代理终答,已弃谈而去。众测其 p95 响应迟滞,计有卅一秒之久。众以为,当易以速模。

实模之责,约占总迟滞之三五。其余六五者,乃次第之工唤,无谓之中介LLM之步,及阙失之流层也。易以小模,或可减最劣之况五秒,然则应答之质亦随之而损。

吾辈更四变,未触模型。P95之迟滞,自卅一秒降至八秒。用户弃用之率,减七成。模型未尝更易。

此乃多数团队或未测或不知优化之延迟堆栈。此乃其式.

时去所之

优化之前,测其器而踪其时之所费。典型多步之器之分解,若此:

  • LLM调用(累积):总延迟之30-50%
  • 工具调用(累积):总延迟之三十至四十
  • 网络与序列化开销:五至十
  • 步骤间应用逻辑:五至十

LLM调用部分,非更易模型不可优化。其余五十至七十者,乃胜之关键,然多队不察,盖因LLM似为显见之窒塞也

创始者之代理者有:

  • 每应答调用 LLM 八次(策士加四工具调度者加评者加二重试)
  • 调用工具六次(四异工,加二重试)
  • 皆顺次为之
  • 不复流示于用户

总 p95 为卅一秒。模型负十一秒。余廿秒乃结构之故

更一:使独立工具调用并行之

诸代理系统之至胜,在于使工具调用并行.

创始之代理,有步骤需查用户之账目、顾客之订单历史、相关之支持票证。原码依次为之:

  1. 查账(1.5秒)
  2. 查订单(2.0秒)
  3. 查票(1.8秒)

總計時五點三秒,皆待時也.

其修乃並行分派之async-await也.

  1. 三者之查皆同時遣之.
  2. 待三者皆竟.
  3. 總計時二點秒(三者之中最遲者).

此獨一之變省時三點三秒於每請求。實際之變更僅八行。

其式也:凡使者呼众器,其果相离,则众器之呼可并行。于 Python 用 asyncio.gather,于 JavaScript 用 Promise.all。今世使者之架(LangGraph、CrewAI Flows)原生支持此法。

察使者,审其器呼次第,若可并行者。几率有一二。

改二:删不必要之中间步骤。

原代理有策士之LLM,生成策略,复有评者之LLM,审其策,终有执行者.

评者拒策略约八分之一,是故九十二分之时,徒增一LLM之呼(三秒),无益也。至其所拒八分之一,策士于次迭复生相似之策,是谓评者未增实益。

去其评者,存其策者,增其验法焉(所引之器实乎?有无循环?所行之步逾限乎?)。验之疾若毫厘,非若秒刻,所察之弊与评者同。

每请省时三秒,平均计之。质之评集,无殊也。

其式也:链中每调大语言模型,皆当有所值。若一节仅拒输入之小部分,而变通乃确定性之检,则确定性之检速且常优。

变三:将应答流还于用户

原代理待全应生成毕,方示诸于用户。用户睹旋钮三十一秒,乃现全应。

流变改迟滞之觉甚巨。用户一至二秒即睹符,纵全应需八秒。代理实非速,然用者觉速,弃率降者,盖因得显应也。

此实作仅数行代码耳。OpenAI与Anthropic之API皆原生支持流式传输。前端须处理服务器发送之事件(SSE)或网页套接字。用户体验之改善,既宏大且立竿见影。

多步之代理,亦可流式传输中段之进境:『检视尔之账户……查核订单史……生成应答……』此与转轮异质。人将候八秒以观进境;不候八秒以观转轮。

变更四:急速缓存

有查询复见。有中段之步骤跨查询复见。二者皆可缓存。

创始人之代理人,有"政策查检"之器,取自公司小库之规条。此规条或一月一易。每遇查检,必击此库,行向量之索。吾等增一简易之内存缓存,设五分钟之时效。缓存之击中率得三十五,每缓存之请,省时零点八秒。

语义缓存者,乃更进之术也:以查询之嵌入为键,缓存大语言模型之应答。若有新询,与近时者语义相近,则返所缓存之应。吾辈慎用此法(唯适用于置信度高、陈旧风险低之询),然命中缓存时,可省时五秒以上。

穷极缓存之道:凡器用之召,神明之唤,皆当审其可缓。大抵皆可。然众团鲜有问津。迟滞之减,日积月累,其效愈彰。

创始者所成之业

既行四变:

  • 并行器用之召:减三秒
  • 去除批评神明之唤:减三秒
  • 流畅之术:首字之迟,自卅一秒减至一至二秒
  • 蓄存:平均-2秒
  • 总p95:31秒→8秒

初两周,用户弃用率降70%。客户满意度增。模型与代理之核心逻辑未变

团队原意(改用小模型)或可省时5秒,然质降。结构之变省时23秒,质犹存

滞后优化清单

许可"切换至更速模型"为滞后之解,当循此而审:

  • 工具调用,当并行者,独串行之乎?几无例外,必有一组可并者。
  • 每LLM之步,皆有所值乎?凡批评者、验证者、精炼者,动辄发于每请,或可去之,择定规之检。
  • 乃见进境乎?抑或但睹旋轮?独流不若换模,改迟之甚。
  • 重复之问,可蓄否?工呼、大呼、索得之段:多可蓄而限时而用。
  • 非要之务,可用小模乎?质要则终用GPT-4o;谋、导、验之务,则用GPT-4o-mini。

诸代理系统,不触本模,即可得五十至七十之迟滞改善。更易前,先审之。

易模乃万不得已之策

易为小而速之模,于某些代理,固为正道。然此非首选,乃万不得已之策。其代价,乃应答之质,此代价永存。而结构之改善,则无本无息。

萨波塔所荐之范式:首优化结构,次测新之迟滞,再决模型易否尚需。往往非需。若需,则所差甚微,权衡自更允当。

若尔之代理迟滞,害及体验

若汝之众已发智械之使,而人众未待其应而弃谈,其弊非常在模也,多在构也。

桑塔普提供一周时滞审计,可追索代理实际耗费之时,辨析可并行之工具调用,剔除非必要之大型语言模型步骤,察寻缺失之缓存机遇,并交付优化之工作代码。吾等已施此术于聊天产品、客户支持工具及研究助理。典型之改善,减时滞六成至七成五,而不改模型。

愿通消息。人工智能工程之页,请示汝现时之p95迟滞,若有样本轨迹,亦请示之。若无轨迹,吾等先施可观测性,后行审计。