Detecting API anomalies behind a 200 OK — with statistics, not AI

Most uptime monitors answer one question: is it up or down? But some of the worst incidents I've dealt with returned a perfectly happy 200 OK:

an endpoint that started serving a cached error page
a JSON API returning {"error": ...} with status 200
a response that quietly got 10× slower

a payload that dropped from 14 KB to 800 bytes because a backend started returning empty results. A plain up/down check sails straight past all of these. I wanted my monitor to notice "it's up, but it's wrong." Here's how I built that — and why I deliberately didn't reach for machine learning (or the word "AI").

THE TEMPTATION, AND WHY I SKIPPED IT

The buzzword move is "AI-powered anomaly detection." But for per-endpoint metrics, ML is mostly overkill: you need training data, the model is opaque, and it's hard to explain why something fired. Plain statistics are simpler, cheaper, deterministic, and — importantly — explainable. So that's what I used.

ONE BASELINE PER ENDPOINT

The key decision: every endpoint is its own baseline. A CDN-cached 2 KB JSON response and a 500 KB HTML page have nothing in common, so a global threshold is meaningless. I track two signals per endpoint:

response size (bytes)
response time (ms)

For each, I keep a rolling baseline and ask: is the latest value weird for this endpoint?

THE MATH: ROLLING MEAN, STD, AND 3Σ

Standard stuff — flag a value when it's more than three standard deviations from the mean:

|value − mean| > 3 · σ

The trick is computing it cheaply. I don't want to load an endpoint's entire history on every check. Mean and variance only need three running aggregates — count, sum, and sum of squares — which is a single SQL query:

SELECT COUNT(*) AS n,
SUM(value) AS s,
SUM(value * value) AS q
FROM signal
WHERE endpoint_id = ?
AND created_at >= ?; -- rolling window

Then:
mean = s / n

population variance = E[x²] − (E[x])²

variance = max(0.0, q / n - mean * mean)
std = variance ** 0.5
No history transfer, no model, just three numbers.

THE GUARDRAILS (WHERE MOST OF THE REAL WORK IS)

Raw 3σ is noisy. The interesting part is stopping false positives:

Floors for stable endpoints. A very stable endpoint has σ ≈ 0, so any tiny wobble is "> 3σ" and you'd page yourself constantly. So the threshold is the max of three things:

threshold = max(3 * std, # statistical
rel_floor * mean, # e.g. +50% for size
abs_floor) # e.g. +500 ms, an absolute minimum

A change has to be statistically and practically significant.

Direction matters. For size, I flag both directions — a bloated response and a truncated/empty one are both bad. For latency, only slower counts.

def is_anomalous(value, mean, std, *, rel_floor, abs_floor, both_directions):
threshold = max(3 * std, rel_floor * mean, abs_floor)
delta = value - mean
flagged = abs(delta) > threshold if both_directions else delta > threshold
return flagged, ("up" if delta > 0 else "down")

No flapping. One weird check isn't an incident. I require two anomalous checks in a row, in the same direction, before alerting.
Warm-up guard. Below a minimum sample count (I use ~50), I don't alert at all — there's no trustworthy baseline yet.

Together these turn a noisy 3σ trigger into something that only fires when an endpoint genuinely behaves unlike itself.

SO WHERE DOES THE "AI" COME IN?

Here's the line I care about: detection is statistics; AI only explains.

Once the math flags something, I hand the numbers to a small LLM call to turn this:

payload size dropped from ~14 KB to ~800 B (−94%), 2 checks in a row

into this:

"This endpoint is likely returning an error or empty payload instead of its usual response — the body shrank by ~94% while still answering 200 OK."

The model writes the human sentence. It does not decide what's anomalous. I refuse to market a 3σ threshold as machine learning. Detection = math, explanation = language. Calling the whole thing "AI anomaly detection" would be a lie about which part is which.

TAKEAWAY
You don't need ML to catch "it's up but it's broken." A per-endpoint rolling baseline, a max(3σ, relative floor, absolute floor) threshold, a direction rule, and a two-in-a-row guard get you surprisingly far — and every alert stays fully explainable, which beats a black box when you're staring at it at 3 a.m.

This runs in PingMon (pingmon.de), the uptime monitor I'm building, but the technique is general — you can bolt it onto anything with a metric history. Happy to go deeper on the windowing or the per-tier cost controls in the comments.

— Dario

推荐订阅源

DEV Community

population variance = E[x²] − (E[x])²