惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google Online Security Blog
Google Online Security Blog
罗磊的独立博客
Jina AI
Jina AI
博客园 - Franky
Google DeepMind News
Google DeepMind News
Cyberwarzone
Cyberwarzone
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
NISL@THU
NISL@THU
AWS News Blog
AWS News Blog
aimingoo的专栏
aimingoo的专栏
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
N
Netflix TechBlog - Medium
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Engineering at Meta
Engineering at Meta
H
Heimdal Security Blog
大猫的无限游戏
大猫的无限游戏
D
Docker
P
Privacy International News Feed
G
GRAHAM CLULEY
量子位
宝玉的分享
宝玉的分享
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tailwind CSS Blog
The Cloudflare Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
G
Google Developers Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
博客园 - 【当耐特】
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
S
Security Affairs
T
Threatpost
N
News and Events Feed by Topic
Google DeepMind News
Google DeepMind News
阮一峰的网络日志
阮一峰的网络日志
博客园 - 司徒正美
B
Blog RSS Feed
Schneier on Security
Schneier on Security
Vercel News
Vercel News
Martin Fowler
Martin Fowler
T
The Exploit Database - CXSecurity.com
C
Cyber Attacks, Cyber Crime and Cyber Security
P
Proofpoint News Feed
The GitHub Blog
The GitHub Blog
J
Java Code Geeks
V2EX - 技术
V2EX - 技术
S
Securelist
Stack Overflow Blog
Stack Overflow Blog
The Last Watchdog
The Last Watchdog
Cloudbric
Cloudbric
Cisco Talos Blog
Cisco Talos Blog

alerting systems Archives – TechEmpower

暂无文章

Real-time Monitoring of LLM-Based Applications
Tony Karrer · 2025-10-03 · via alerting systems Archives – TechEmpower

We’re starting to see a pattern with LLM apps in production: things are humming along… until suddenly they’re not. You start hearing:

  • “Why did our OpenAI bill spike this week?”
  • “Why is this flow taking 4x longer than last week?”
  • “Why didn’t anyone notice this earlier?”

It’s not always obvious what to track when you’re dealing with probabilistic systems like LLMs. But if you don’t set up real-time monitoring and alerting early, especially for cost and latency, you might miss a small issue that quietly escalates into a big cost overrun.

The good news: you don’t need a fancy toolset to get started. You can use OpenTelemetry for basic metrics, or keep it simple with custom request logging. The key is being intentional and catching the high-leverage signals.

Here are some top reads that will help you get your arms around it.

Top Articles

  • A crisp primer that defines token count, latency, and cost as the pillars of observability. It’s tool-agnostic and shows how to wire up Prometheus dashboards via OpenTelemetry.

  • This one gets into the weeds but in a good way. It walks through tagging each request with a prompt ID and user ID so you can trace token spikes back to real root causes. Comes with useful alert rule examples.

  • Useful latency benchmarks per use case: chat, search, RAG. Suggests setting alert thresholds at 20% over your p95 SLOs to catch slippage early.

  • Starts broad, then gets practical. Has a great checklist for real-time dashboards, latency and token gauges, plus rituals like weekly reviews to refine thresholds. Also dives into pros/cons of current tools.

  • A broader take on the space, but solid advice. Introduces a three-layer stack (telemetry → dashboards → alerts) and gives sample PagerDuty rules for token or latency anomalies.