惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Jina AI
Jina AI
爱范儿
爱范儿
博客园 - 司徒正美
Last Week in AI
Last Week in AI
IT之家
IT之家
博客园 - 聂微东
大猫的无限游戏
大猫的无限游戏
博客园 - 三生石上(FineUI控件)
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
月光博客
月光博客
博客园_首页
Hugging Face - Blog
Hugging Face - Blog
Microsoft Azure Blog
Microsoft Azure Blog
博客园 - 【当耐特】
人人都是产品经理
人人都是产品经理
D
DataBreaches.Net
Stack Overflow Blog
Stack Overflow Blog
B
Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Spread Privacy
Spread Privacy
Attack and Defense Labs
Attack and Defense Labs
T
Troy Hunt's Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
S
Security Affairs
E
Exploit-DB.com RSS Feed
Hacker News - Newest:
Hacker News - Newest: "LLM"
PCI Perspectives
PCI Perspectives
D
Darknet – Hacking Tools, Hacker News & Cyber Security
D
Docker
宝玉的分享
宝玉的分享
MongoDB | Blog
MongoDB | Blog
量子位
Blog — PlanetScale
Blog — PlanetScale
G
Google Developers Blog
S
Security Archives - TechRepublic
T
Tor Project blog
H
Hacker News: Front Page
S
Secure Thoughts
W
WeLiveSecurity
Hacker News: Ask HN
Hacker News: Ask HN
P
Privacy International News Feed
Simon Willison's Weblog
Simon Willison's Weblog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
P
Proofpoint News Feed
Martin Fowler
Martin Fowler
T
Threatpost
aimingoo的专栏
aimingoo的专栏
www.infosecurity-magazine.com
www.infosecurity-magazine.com
T
Threat Research - Cisco Blogs

Hacker News - Newest: "AI"

AI can't read an investor deck AI as an attorney? Student uses ChatGPT, Gemini to sue UW over alleged racial discrimination Hacking MCP Servers in AI Systems – The Rug Pull: Tool Changes After Approval GitHub - MeepCastana/KubeezCut: Free Web based video editor GitHub - GenAI-Gurus/awesome-eu-ai-act: Curated tools, official sources, OSS, templates, and guides for EU AI Act compliance. Can AI judge journalism? A Thiel-backed startup says yes, even if it risks chilling whistleblowers Coming soon: 10 Things That Matter in AI Right Now DARPA built an AI to fact-check enemy weapons claims IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures What explains heterogeneity in AI adoption? When AI Meets Muscle: Context-Aware Electrical Stimulation Promises a New Way to Guide Human Movements - Department of Computer Science AI Changed How We Build. It Did Not Change What Matters. Linux rules on using AI-generated code - Copilot is OK, but humans must take 'full responsibility for the… Meta spins up AI version of Mark Zuckerberg to engage with employees Code Mode: Let Your AI Write Programs, Not Just Call Tools | TanStack Blog GitHub - Delavalom/graft: Go framework for building AI agents. Type-safe tools, multi-provider (OpenAI, Anthropic, Gemini, Bedrock), zero vendor SDKs. India's TCS tops estimates, says new AI models did not dent services demand Gen Z's fading AI hype Strong feeling: we are in a folded AI reality GitHub - machinarii/total-recall-catalog: A reference catalog of latest knowledge retrieval, memory & RAG systems GitHub - mensfeld/code-on-incus: Give each AI agent its own isolated machine with root, Docker, and systemd. Active defense detects and stops threats automatically.. Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI Iran war: We spoke to the man making Lego-style AI videos that experts say are powerful propaganda Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks GitHub - immartian/bellamem: Persistent belief-graph memory for AI agents. Retrieves decisive context by importance — not recency, not RAG, not /compact. recursive-mode: The Repo-Native Operating System for AI Engineering After the attack on Sam Altman's home, will AI CEO's go on the offensive? The biggest advance in AI since the LLM Opus 4.6 vs GPT 5.4 One Prompt Unity World Generation Test “AI polls” are fake polls Client Challenge Can AI be a 'child of God'? Inside Anthropic's meeting with Christian leaders How to Switch AI Chatbots and Why You Might Want To GitHub - MattMessinger1/agentic_refund_guardrail: Safe refund policy layer for AI agents — Python + TypeScript. Same behavior, shared tests. Adam/papers/emergent_values_whitepaper.md at master · strangeadvancedmarketing/Adam Ask HN: How do you stop playing 20 questions with your AI coding tools How far can automation and AI support psychotherapy? - @theU GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits A Mac Studio for Local AI — 6 Months Later A History of the Early Years of AI at the University of Edinburgh Why AI Coding Tools Still Feel Stuck on Localhost MSN AI Datacenters Are Becoming Strategic Targets twitter.com Penn Researchers Use AI to Surface Unreported GLP-1 Side Effects in Reddit Posts Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 AI models are terrible at betting on soccer—especially xAI Grok GitHub - xialeistudio/echoic GitHub - HimashaHerath/github-dev-wrapped: AI-powered weekly GitHub activity reports deployed to GitHub Pages GitHub - alejandrobalderas/claude-code-from-source: Architecture, patterns & internals of Anthropic's AI coding agent — reverse-engineered from source maps AI and Tech brief: Ireland ascendant GitHub - Titovilal/context0: Context0 - Never Surrender Training for a Marathon with an AI Coach: What Worked and What Didn't Cyber Pulse: Agentic Intel - Apps on Google Play I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs Gen Z workers are so fearful AI will take their job they’re intentionally sabotaging their company’s AI rollout | Fortune How AI Is Reimagining the Game of Golf–For Both Players and Courses GitHub - nattergabriel/reseed: A CLI tool for managing and distributing agent skills across projects Is SVG the final frontier? My AI workflow evolved from prompts to a near-autonomous workflow MLSharp Help - 3DGS Viewer & Generator I put my cognitive field based AI's runtime on GitHub Is Numble the first AI-proof game? A3: Kubernetes for autonomous AI agent fleets | Emergent Principles Deepali Vyas ("The Elite Recruiter") GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Unionized ProPublica staff are on strike over AI, layoffs, and wages Unleashing the Advantage of Quantum AI We're heading for an AI-fueled 'dementia crisis,' brain scientist warns The AI-Assisted Breach of Mexico's Government Infrastructure [pdf] GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. MSN GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way What leaked "SteamGPT" files could mean for the PC gaming platform's use of AI AI is the boss at this retail store. What could go wrong? GitHub - Wuzu11517/agentic-proxy: Local proxy meant to help reduce With Drones, Geophysics and ArtificiaI Intelligence, Researchers Prepare to Do Battle Against Land Mines A Single Operator, Two AI Platforms, Nine Government Agencies: The Full Technical Report 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - inevolin/resume-cli: Hit Claude usage limits? Resume any AI coding session elsewhere. Switch tools at zero friction. GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript This Startup Wants You to Pay Up to Talk With AI Versions of Human Experts Intel Arc Pro B70 Brings 32GB VRAM to Local AI for $949 WordPress 7.0: The Good, the AI, and the Still Missing AI on the couch: Anthropic gives Claude 20 hours of psychiatry AI Agents Know About Supabase. They Don't Always Use It Right. The history and future of AI at Google, with Sundar Pichai Inside an AI‑enabled device code phishing campaign How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines AI for Systems: Using LLMs to Optimize Database Query Execution Forecasting the Economic Effects of AI Introducing Tinker: Play with AI, bring your ideas to life AI sheds light on an ancient gaming mystery People really hate AI but not as much as Iran—or Democrats | Fortune What is an AI Product Engineer? Phoebe Gates wants her $185 million AI startup to succeed with 'no ties to my privilege or my last name': 'I have a chip on my shoulder' | Fortune
Stop Monitoring AI Systems Like Web Services
Aurimas Griciūnas · 2026-06-14 · via Hacker News - Newest: "AI"

👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in AI Engineering, Data Engineering, Machine Learning and overall Data space.

Many AI systems are still monitored like the web services they sit next to. The API gateway emits uptime, error rates, and latency percentiles, the dashboards come free with the infrastructure. Unfortunately, none of those numbers can tell you that users stare at a blank screen for four seconds before the first token is displayed, or that token spend per task has doubled since the last prompt update, or that the model has started inventing answers around the retrieved context instead of from it.

The gap exists because an LLM system breaks the assumptions web monitoring was built on. Responses are generated token by token, so “latency” is at least three different numbers depending on where on the timeline you stand. Cost scales with tokens rather than requests. Also, the most damaging failures are silent: a quality regression does not throw a 500, it returns confident text with a 200 status code.

For me personally it helps to group metrics by the question they answer. Five questions cover most of what goes wrong in production: is it fast, can it scale, is it correct, does it hold up, and when there is an agent in the loop, how does it behave. This article walks through each group, what the metrics mean mechanically, and which ones you have to build yourself because nothing emits them by default.

The metrics map

Before moving to the questions, next week on Thursday I will be running a workshop - From AI Demo to Deployed App.

AI engineers can build the backend, but most stop when it’s time to put a usable frontend in front of real users.

Join me and learn how to:

  • Port a Streamlit prototype or a vibecoded frontend app to v0

  • Ship the app to production on Vercel

  • Ship a new UI feature on top of an existing backend

Register

June 18th, 15:00 GMT

An LLM request has two phases, and they produce different latency numbers. During prefill the model ingests the entire prompt and builds its internal state, with nothing visible to the user while it happens. During decode the model generates output one token at a time. Every latency metric worth tracking is a position on that timeline.

Time to first token (TTFT) is the amount of time from sending the request to the first token arriving, which is queueing time plus prefill. In a streaming UI this is the number users feel (perceived latency), because it is exactly how long they look at a blank screen. TTFT grows with prompt length, which is why RAG systems that pack large contexts into the prompt pay for it in perceived speed.

Inter-token latency (ITL, also reported as time per output token, TPOT) is the gap between consecutive tokens once streaming starts, and it determines whether output reads as flowing text properly. Users tolerate a slow but steady stream far better than a fast one that freezes.

End-to-end latency at p50, p95, and p99 is the full span, and output length dominates it. That makes a single global percentile close to meaningless: it averages 50-token classification calls with 2,000-token report generations. Track end-to-end latency per use case, so each number has one workload behind it.

Agents add a compounding effect. A task that chains several sequential LLM calls multiplies the per-call numbers, and a tolerable per-call p95 can become an intolerable task-level latency. For agentic workloads, set the latency budget at the task level and let it constrain the steps.

Inference latency metrics

Tokens per second per user versus total system throughput. Serving systems batch concurrent requests on the same hardware, and the batch size is what you can adjust: larger batches improve total system tokens per second while each individual stream slows down. The two metrics trade off against each other on the same GPUs. So decide per workload which one you are protecting. Interactive chat should favour per-user speed, offline batch processing should favour total throughput.

Input and output tokens per request. There is not much to add here, it is the core of unit economics, with output tokens typically priced at a multiple of input tokens. Log both per request, break down by use case, monitor carefully.

Cache hit rate. Prompt caching makes repeated prompt prefixes dramatically cheaper to process, it is one of the largest single cost levers in systems with stable system prompts, tool definitions, or shared document context. It is also easy to get wrong, because caching matches the prompt prefix byte for byte: a timestamp interpolated near the front of the prompt, or a reordered tool list invalidates the cache. Monitor for falling cache hit rate, it can signal that someone changed how prompts are assembled.

Cost per successful task, not cost per request. Cost per request makes systems that fail cheaply look good. A request that costs a third as much but succeeds half as often is more expensive where it counts, and failed attempts trigger retries that multiply spend invisibly. This is also the connective metric across all five groups: you can push latency, cost, or quality in isolation, and cost per successful task is the number that tells you whether the system as a whole got better or worse.

Cost per successful task

Nothing in the serving stack emits correctness. Every metric in this group you build yourself.

Task success rate on a labeled eval set. A few dozen to a few hundred examples with known good outcomes, re-run on prompt changes, model upgrade, and retrieval change. This is the regression suite of an AI system. The set does not need to be large to be useful, it needs to be representative and maintained.

Groundedness for RAG. Is the answer supported by the retrieved context, or invented around it? A system can have excellent retrieval and still hallucinate between what was retrieved and what was written. This is becoming a lot less of a problem with powerful models, but in real production systems you eventually want to use the smallest models possible to reduce cost.

Retrieval precision and recall @ k. Of the k chunks retrieved, precision@k asks how many were actually relevant. Recall@k asks how much of the relevant material made it into those k in the first place. Generation cannot fix what retrieval never surfaced, so when answer quality drops, check the retrieval metrics before rewriting prompts. Retrieval is still the harder problem in AI systems.

LLM-as-judge scores over time, calibrated against human labels. Judges make quality measurable at volume, and they drift - a judge model upgrade can change scores with no change in the system being tested. Calibrate against a sample of human labels at the start, and re-calibrate whenever the judge model changes or the score trend looks too good.

User feedback signals. Explicit thumbs are rare and biased toward the extremes. Regeneration rate and how heavily users edit generated output before accepting it are more often and honest, because they are recorded at the moment the user acts on the output rather than when they feel like rating it.

Quality measurement for RAG systems

Learn more about the lifecycle of AI System development utilising Quality metrics in my previous article:

Agentic AI Flywheels

Error, timeout, and rate-limit rates per provider. The per-provider split matters because aggregate availability hides which dependency is degrading, and rate limits in particular tend to arrive in bursts tied to one provider’s quota.

Retry and fallback rate. Falling back to a backup model keeps availability green while quality quietly changes. Track quality metrics per serving path..

Guardrail trigger and refusal rates. A spike in either means user behavior drifted, an injection attempt might be happening, or a model change moved the refusal boundary. In any case, it’s worth tracking all and it is cheap once the guardrail actions are properly logged.

Guardrails and model routing

Learn more about how to measure your AI systems hands-on in my End-to-End AI Engineering Bootcamp.

Kicking off in a week.

Apply code LASTCHANCE15 for 15% off.

Register

Agents fail in ways single LLM calls will not, because the model makes a sequence of decisions and each one can quietly degrade.

Tool-call error rate, split by cause. Schema and argument errors mean the model misunderstands the tool, and the fix is in the tool description. Execution errors mean the tool itself broke, and the fix is in your code.

Steps and tokens per completed task. Success rate holding steady while steps per task increase means cost is rising while accuracy does not. Watch this after every model swap and prompt change.

Context window utilization. The early warning for compaction and truncation issues. Quality usually degrades while the context fills, so a utilization trend tells you to intervene while the failure is still invisible to users.

I wrote about the state of context engineering in 2026 in one of my latest articles:

State of Context Engineering in 2026

Loop detection. The rate at which an agent repeats an identical tool call with identical arguments. A loop is pure token burn, and it is detectable from the trajectory log with a few lines of code.

Looking back across the five groups a pattern is clear. Latency and reliability show up on day one, because the gateway, the load balancer, and the provider SDK emit them as a side effect of serving traffic. Quality, cost per task, and agent behavior metrics are not produced until you build the pipeline that measures them: token and cache fields logged from every response, an eval set that runs on every change, a judge with calibration, trajectory logs for agents.

This is also why thorough monitoring and real blind spots coexist so often: the monitoring covers the dimensions the infrastructure reports, and the failures sit in the dimensions it does not. The consequence is problems in budgeting. Instrumentation for quality and cost per task belongs in the initial build estimate for an AI feature, not in a later review.

You should not slow down your development efforts by trying to implement all of the measures at once. An order that provides most information per effort:

  1. TTFT and end-to-end latency, per use case. If you stream, TTFT is the users perceived latency, both come from timestamps you already have.

  2. Tokens per request and cache hit rate. Both are fields on the API response, logging them is easy, and they make unit economics clearer immediately.

  3. A labeled eval set with task success rate. Start with fifty examples and the discipline to re-run on every relevant change.

  4. Per-provider error and fallback rates. Cheap to add, and the fallback-quality interaction is a common silent degradations.

  5. Agent metrics, when you ship an agent. Tool-call errors split by cause, steps per task, and context utilizationcan all be found in the trajectory log.

This list is representative rather than exhaustive. It deliberately leaves out GPU-level serving metrics (KV cache utilization, batch occupancy) that matter when you host inference yourself (I will cover these in a separate article), embedding and data drift, and business-layer metrics such as containment rate or revenue per session.

  • Group metrics by the question they answer. Is it fast, can it scale, is it correct, does it hold up, how does the agent behave.

  • Track latency as positions on the request timeline. TTFT for perceived speed, ITL for stream smoothness, end-to-end per use case. A single global p95 describes none of your workloads.

  • Make cost per successful task the connective number. It is the one metric that joins latency, cost, and quality.

  • Budget in instrumentation work, not just dashboard work. Quality, cost per task, and agent behavior emit nothing by default.

Hope you found the article useful and see you in the next one!

Partner with SwirlAI

Share

AI Engineering Bootcamp