惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

S
Schneier on Security
Hugging Face - Blog
Hugging Face - Blog
V
Visual Studio Blog
博客园 - Franky
酷 壳 – CoolShell
酷 壳 – CoolShell
Last Week in AI
Last Week in AI
博客园 - 叶小钗
博客园_首页
阮一峰的网络日志
阮一峰的网络日志
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Application and Cybersecurity Blog
Application and Cybersecurity Blog
TaoSecurity Blog
TaoSecurity Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
爱范儿
爱范儿
宝玉的分享
宝玉的分享
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
量子位
N
News and Events Feed by Topic
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recent Commits to openclaw:main
Recent Commits to openclaw:main
SecWiki News
SecWiki News
MyScale Blog
MyScale Blog
AI
AI
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
博客园 - 【当耐特】
Security Archives - TechRepublic
Security Archives - TechRepublic
F
Fortinet All Blogs
V2EX - 技术
V2EX - 技术
T
Troy Hunt's Blog
有赞技术团队
有赞技术团队
W
WeLiveSecurity
Project Zero
Project Zero
T
Tor Project blog
Help Net Security
Help Net Security
L
LINUX DO - 最新话题
IT之家
IT之家
The Hacker News
The Hacker News
腾讯CDC
Schneier on Security
Schneier on Security
N
News and Events Feed by Topic
C
Cisco Blogs
博客园 - 聂微东
Webroot Blog
Webroot Blog
Forbes - Security
Forbes - Security
M
MIT News - Artificial intelligence
C
Cyber Attacks, Cyber Crime and Cyber Security
雷峰网
雷峰网
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
A
About on SuperTechFans

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Silent failures in production - why conventional tools miss them and how NotiLens catches them
Stephen Souza · 2026-05-20 · via DEV Community

Your server is up. Your API is responding. Your error rate is zero.

And your business has been quietly bleeding for six hours.

This is the silent failure problem and it's more expensive than any crash you've ever had.


What is a silent failure?

A silent failure is a failure that produces no error signal.

No exception thrown. No status code changed. No alert fired. No dashboard turned red. The system is technically operating while something in the business logic has quietly stopped working.

Three characteristics define a silent failure:

No error signal — the failure doesn't produce an exception, a non-2xx response, or a log entry that reads like a problem. Everything looks clean.

Invisible onset — silent failures start at a specific moment but aren't noticed until hours or days later, when the damage has already compounded.

Business impact before technical detection — the first signal is usually a user complaint, a dropped metric in a weekly report, or an end-of-day revenue number that looks wrong.


Why silent failures cost more than crashes

A crash is visible. Something turns red. An alert fires. The fix starts within minutes.

The damage window of a crash is bounded — it ends when detection happens.

A silent failure is invisible. The damage window is unbounded — it ends when someone happens to check the right screen, a user complains, or a weekly report surfaces something unexpected.

The math:

$80 average order value
4 orders per hour

Crash detected in 5 minutes  → ~$27 lost
Silent failure for 6 hours   → $1,920 lost

Same root cause. Completely different damage profile.

And that's before refund processing, support tickets, trust damage, and engineering hours spent reconstructing what happened from logs that weren't designed to answer that question.

Fix time is bounded. Detection time isn't. That's where the real cost lives.


Where silent failures hide in production

Payment flows

payment.initiated fires. Stripe delivers the webhook. Your endpoint returns 200. Somewhere between delivery and database write, business logic fails silently. Stripe marks it delivered. Dashboard shows green. Revenue isn't recording.

Or the subtler version — payment.initiated fires and payment.completed never arrives. Each webhook delivers correctly. The sequence never finishes. No error. No alert.

Cron jobs

Two variants:

Variant 1 — job stops running entirely. Nightly invoice sync stopped running Thursday. Found out Monday. Four days of un-synced records, zero error thrown.

Variant 2 — job runs but processes nothing. The job runs at midnight. Starts. Exits clean. Processed zero records. From the outside — healthy run. From the inside — nothing happened.

Variant 2 is harder to catch because nothing failed by definition. The job completed successfully against zero records.

AI agent loops and ghost runs

AI agents don't crash — they drift, loop, stall, and consume resources while producing nothing.

A looping agent looks identical to a healthy agent doing legitimate multi-step research. Process running. Tool calls firing. Tokens accumulating. What's actually happening — the agent called the same tool 47 times, produced nothing, ran up $4.80 in tokens.

A ghost run is the agent equivalent of the cron job that processed zero records. The agent ran. Completed. Reported task done. Produced nothing meaningful.

Automation workflows

Your Zapier zap hasn't fired since Tuesday. Your n8n workflow silently stopped three days ago. Your Make scenario failed on one step and the entire workflow halted.

Automation failures live outside your main application stack. They don't throw exceptions. They don't affect uptime. They just stop.

Signup and onboarding flows

Your signup flow broke at 2am. Form submits. Confirmation email never sends. User lands on a broken onboarding step. Every component looks healthy. The sequence never completes.

No new signup in 4 hours on a Wednesday when your baseline is 12 per hour. Your monitoring has no alert for that. No vocabulary for absence.


Why conventional monitoring misses silent failures

Every tool in your observability stack was built to watch infrastructure. Silent failures are a business logic problem.

Uptime monitoring — watches whether your service responds. A silent failure doesn't affect uptime. The server is up. That's part of what makes it silent.

Error rate monitoring — watches for exceptions and non-2xx responses. Silent failures don't throw exceptions. Error rate stays clean.

APM tools — watch latency, throughput, error rates at the service level. No coverage for business logic correctness or event frequency.

Log monitoring — can surface silent failures but requires you to know what to look for before you can find it. Reactive, not proactive.

Threshold alerts — require you to define abnormal before the failure happens. You can't threshold what you haven't seen yet.

The fundamental gap: every conventional monitoring tool watches for something going wrong. Silent failures are defined by the absence of something going right.


How to detect silent failures with NotiLens

NotiLens is built specifically around the silent failure problem — business pulse monitoring that watches the things that should be happening and alerts the moment they aren't.

Before you start

  1. Create a free account at notilens.com
  2. Create a source in the dashboard — you'll get a token and secret
  3. Install the SDK
pip install notilens

npm install @notilens/notilens

Initialize

Python

import notilens

nl = notilens.init(
    name="my-app",
    token="YOUR_TOKEN",
    secret="YOUR_SECRET"
)

# Or use environment variables: NOTILENS_TOKEN / NOTILENS_SECRET
nl = notilens.init(name="my-app")

Node.js

import { NotiLens } from '@notilens/notilens';

const nl = NotiLens.init('my-app', {
  token: 'YOUR_TOKEN',
  secret: 'YOUR_SECRET'
});

// Or use environment variables: NOTILENS_TOKEN / NOTILENS_SECRET
const nl = NotiLens.init('my-app');


1. Business event monitoring — silence detection for payments, signups, orders

Track events with nl.track(). NotiLens watches the frequency and fires a silence alert when they stop arriving.

Python

# Track payment events
nl.track("payment.completed", "Payment received", meta={"amount": 149.99})

# Track signups
nl.track("user.signup", "New user registered")

# Track orders
nl.track("order.placed", "Order #1234", meta={"amount": 89.00})

Node.js

// Track payment events
nl.track('payment.completed', 'Payment received', { meta: { amount: 149.99 } });

// Track signups
nl.track('user.signup', 'New user registered');

// Track orders
nl.track('order.placed', 'Order #1234', { meta: { amount: 89.00 } });

When payment.completed stops arriving for longer than your baseline window — silence alert fires. NotiLens learns your normal frequency automatically. No threshold to configure.


2. Cron job monitoring — records processed, time taken, missed runs

The most common silent failure — job runs, processes nothing, exits clean, looks healthy.

NotiLens tracks records processed and time taken alongside the heartbeat. Zero records on a job that normally touches 500 is the alert. Job that didn't ping at all — also the alert.

Python

import notilens

nl  = notilens.init(name="invoice-sync", token="TOKEN", secret="SECRET")
run = nl.task("nightly-sync")
run.start()

try:
    run.progress("Fetching invoices")
    records = process_invoices()

    run.metric("records", records)       # track records processed
    run.metric("duration_ms", 1240)      # track time taken

    run.complete(f"Processed {records} invoices")
except Exception as e:
    run.fail(str(e))

Node.js

import { NotiLens } from '@notilens/notilens';

const nl  = NotiLens.init('invoice-sync', { token: 'TOKEN', secret: 'SECRET' });
const run = nl.task('nightly-sync');
run.start();

try {
  run.progress('Fetching invoices');
  const records = await processInvoices();

  run.metric('records', records);        // track records processed
  run.metric('duration_ms', 1240);       // track time taken

  run.complete(`Processed ${records} invoices`);
} catch (err) {
  run.fail(err.message);
}

NotiLens ML learns your expected record volume and duration. A run that took 4x longer than usual — even with the same record count — is the alert.


3. AI agent loop detection — iteration count, token burns, duration anomalies

Call run.loop() on every agent iteration. NotiLens ML learns how many iterations your agent normally runs and alerts when a run is anomalously high with no run.complete() arriving.

Python

import notilens

nl  = notilens.init(
    name="research-agent",
    token="TOKEN",
    secret="SECRET",
    patch=True  # auto-instruments OpenAI, Anthropic, LangChain calls
)

run = nl.task("research")
run.start()

try:
    run.progress("Starting research")

    for i, step in enumerate(steps):
        run.loop(f"[{i+1}] Tool: {step.tool_name}")  # every iteration
        run.metric("tool_calls", 1)                   # accumulates

        result = agent.execute(step)
        run.metric("tokens", result.usage.total_tokens)
        run.metric("cost_usd", result.usage.cost)

    run.output_generated("Research complete")
    run.complete(f"Processed {len(steps)} steps")
except Exception as e:
    run.fail(str(e))

Node.js

import { NotiLens } from '@notilens/notilens';

const nl  = NotiLens.init('research-agent', { token: 'TOKEN', secret: 'SECRET' });
const run = nl.task('research');
run.start();

try {
  run.progress('Starting research');

  for (const [i, step] of steps.entries()) {
    run.loop(`[${i+1}] Tool: ${step.toolName}`);  // every iteration
    run.metric('tool_calls', 1);                   // accumulates

    const result = await agent.execute(step);
    run.metric('tokens', result.usage.totalTokens);
    run.metric('cost_usd', result.usage.cost);
  }

  run.outputGenerated('Research complete');
  run.complete(`Processed ${steps.length} steps`);
} catch (err) {
  run.fail(err.message);
}

When the alert fires you see: which agent, which task, current loop count vs baseline, token consumption vs baseline, last progress message before deviation started.


4. Agent stall detection — when agents pause on slow tools

For agents pausing on slow external APIs or tools:

Python

run.wait("Awaiting API response")
result = call_slow_external_api()   # if this stalls, Smart Silence Detection alerts
run.progress("API response received")

run.wait() is non-terminal — the run continues. NotiLens learns how long your agent normally spends between events and fires if the gap becomes anomalous.


5. Quick alerts — no task context needed

For simple one-off alerts — disk space, deployment events, server health:

Python

nl.notify("disk.space", "Only 1GB left", level="warning")
nl.notify("deploy.done", "Deployed to production",
    open_url="https://dashboard.example.com/deploys"
)

Node.js

nl.notify('disk.space.full', 'Only 1GB left', { level: 'warning' });
nl.notify('deploy.done', 'Deployed to production', {
  openUrl: 'https://dashboard.example.com/deploys'
});


6. CLI — for shell scripts and bash pipelines

No code changes needed. Register once, use anywhere:

notilens init --name my-app --token YOUR_TOKEN --secret YOUR_SECRET

# Simple notifications
notilens notify order.placed "Order #1234" --name my-app
notilens notify disk.space.full "Only 1GB left" --name my-app --type warning

# Full task lifecycle
notilens start    --name my-app --task nightly-sync
notilens progress "Fetching records" --name my-app --task nightly-sync
notilens metric   records=461 --name my-app --task nightly-sync
notilens complete "Done" --name my-app --task nightly-sync


ML anomaly detection — no thresholds to configure

Most monitoring requires you to know what's wrong before you can alert on it. NotiLens ML anomaly detection learns your baseline automatically.

Your Wednesday signup rate. Your typical payment volume. Your agent's normal run count. Your cron's expected record output. No configuration needed.

What it catches that threshold alerts miss:

Spike detection — payment volume 3x above your Wednesday baseline at 2pm. Could be fraud. Could be a campaign. Either way, worth knowing immediately.

Drop detection — signups down 80% from your Tuesday morning baseline. Server up. No errors. Just unusually quiet.

Drift detection — API response time slowly climbing from 120ms to 250ms over two weeks. No single data point crosses a threshold. The trend is the signal.

Broken flow detectionpayment.initiated normally reaches payment.completed within 3 minutes. When the gap extends, alert fires before the window closes on recovery.

No YAML. No threshold tuning. Cold-start mode shows calibration progress so you know when the model is ready.


The detection gap calculation

Before you close this — run this for your own stack.

Take your most important revenue event. How many per hour during peak? What's your average transaction value? How long would a silent failure run before you'd notice?

events_per_hour × average_value × hours_until_detection = detection_gap_cost

Most teams, when they run that number for the first time, immediately understand why detection time matters more than fix time.

The fix is always bounded. The detection gap is where revenue bleeds.


Full silent failure monitoring checklist

  • Business events tracked with nl.track() — payments, signups, orders
  • run.start() fires when any task begins
  • run.complete() fires on successful completion
  • run.fail() fires on any unhandled exception
  • run.loop() called on every agent iteration
  • run.metric("records", n) tracks output volume on cron jobs
  • run.metric("tokens", n) tracks token usage on AI agents
  • run.metric("duration_ms", n) tracks time taken
  • run.wait() fires when agent pauses on slow external calls
  • ML anomaly detection active — no thresholds needed
  • On-call routing configured for silence alerts
  • Tested — confirmed NotiLens detects when expected events stop arriving

Summary

Silent failures are the expensive failures that conventional monitoring was never built to catch. They don't throw errors. They don't affect uptime. They just quietly stop working while every dashboard stays green.

Detection time is where revenue bleeds. Shrinking the detection gap from 6 hours to 60 seconds doesn't change how fast you fix things. It changes how much there is to fix.

notilens.com