Is AI Getting Quietly Dumber? A 24/7 Benchmark That Catches LLM Degradation

You've probably hit this before — yesterday the AI felt sharp, fixed your bug without you even asking, and threw in a few extra cleanups along the way. Then today, same kind of problem, and suddenly it refuses to touch anything you didn't explicitly point at, or starts going in circles. And you start wondering "wait, is AI getting quietly dumber?"

A lot of people have this feeling but it's hard to pin down. Is the provider messing with the model? Is today's problem genuinely harder? Or is it all in your head? Without a number to point at, it just stays an urban legend.

So this post is about a site called AIStupidLevel — the name pretty much spells it out. It runs round-the-clock tests against the major AI models and turns "is AI getting dumber" into an actual curve you can look at.

Why would an AI "get dumber" in the first place?

A reasonable pushback: doesn't the model get trained once and then frozen? A few things are actually going on behind the scenes:

Silent updates. If you've worked with the API, you've probably seen version strings like gpt-4.0-20240924 — that trailing date hints at a specific build. But when you pick a model inside Codex or similar tools, you don't see that level of detail. Some models don't even have versioned IDs, just a generic gpt-4. So you can't actually tell which build you're hitting, and a silent swap is really hard to notice.
Quantization. To handle global peak traffic and save compute, providers sometimes compress the model.
Throttled compute. Once usage crosses a threshold, providers may cap how much compute each user can pull, which makes outputs feel worse.
Compute migration. When a new model is about to launch, providers sometimes shift compute away from the older model. From my own time hitting the API, this is the one I see most — the same prompt suddenly degrades overnight, then a few days later a stronger version drops, and shortly after the old version goes back to normal.

I covered quantization more thoroughly in Want to Run an LLM Yourself? Understanding Model Parameters and Quantization So You Stop Picking the Wrong Model — short version, it's a compression technique that saves resources but loses some detail.

So an AI model isn't actually a frozen thing. Providers can silently update it, quantize it, throttle it, or migrate compute away from it — and any of those will change what you experience. The hard part is, you usually can't tell.

So what is AIStupidLevel exactly?

AIStupidLevel is a third-party benchmark platform (open source, MIT) that continuously monitors whether AI models are regressing. Think of it like a system health check, except it's checking the "health" of AI models. It runs 24/7 against 21 production models from 7 providers — OpenAI, Anthropic, Google, xAI, DeepSeek, Kimi, GLM — and turns each model's current performance into a score on a dashboard.

It's not run by any AI company, which matters here. You don't want the people grading the models to also be selling them.

How does it actually test?

The core idea is simple:

Fixed question bank, run repeatedly. It maintains a fixed set of tasks, throws them at each model on a schedule, and logs the scores.
Same task N times. Because model outputs are stochastic, it runs each task 5 times and takes the median, plus a 95% confidence interval.

Four suites on a rotation

It doesn't just run one kind of test — it has four suites taking turns, each watching for different things:

Test Suite	Frequency	What it tests
Speed (coding)	Every 4 hours	147 coding problems, overall coding ability
Deep reasoning	Daily	5–7 turn multi-turn dialogues, checks long-conversation logic
Tool calling	Daily	Spins up a real Docker sandbox so the AI can actually run multi-step `execute-command` / `read-file` / `write-file` flows
Drift detection (canary)	Hourly	12 lightweight quick checks, first line of defense

The hourly canary plays sentinel — if something starts looking off, it sounds the alarm. The daily deep reasoning and tool-calling runs are the heavier full-body checkup.

The tool-calling suite is the one I find especially interesting. It actually spins up a Docker sandbox (think of it as an isolated mini-computer) and has the AI run real commands inside it, instead of just "verbally" claiming it can use tools. The results end up much closer to what you actually feel when using AI to write code.

Scoring isn't just right vs wrong

A single coding task gets scored across 9 weighted dimensions:

Dimension	Weight
Correctness	40%
Complexity	20%
Code Quality	15%
Stability	10%
Efficiency	5%
Edge Cases	3%
Debugging	3%
Format	2%
Safety	2%

Correctness clearly dominates, but even if the answer runs, you'll still lose points if the code is a mess, misses edge cases, or spews garbage formatting.

How does it catch the moment a model starts getting dumber?

Just having a score isn't enough — the score naturally bounces around because AI is non-deterministic. The real question is: is this drop a real regression, or just noise?

This is where the project gets technically interesting. It uses an algorithm called CUSUM — short for Cumulative Sum Control Chart. CUSUM didn't come from AI research; it's an old quality-control method from manufacturing. The idea is to keep accumulating the gap between observed performance and the baseline. Once the accumulated gap crosses a threshold, you call it: this isn't noise, something actually changed.

On top of CUSUM, it also runs statistical significance testing (checking whether the difference is statistically meaningful, p-value below 0.05) as a second pass, to keep false alarms down.

The real win: with this statistical machinery, a degradation can be detected within hours of starting, rather than waiting until people are venting on social media that "AI got dumber." The system has 29 warning categories built in for different anomaly patterns.

Reading the Stupid Meter

Each model has a live 0–100 score next to it — higher is better. Next to the score is a status tag telling you which of four states the model is in:

STABLE — performing normally
VOLATILE — jittering
DEGRADED — already worse
RECOVERING — climbing back up

Beyond the current score, it lays out each model's historical curve, so you can compare scenarios like "this model was rock-solid last week, why is it jumping around this week."

In practice it feels more like a stock-trading dashboard — you're not looking at a static ranking, you're checking "right at this moment, which model is worth using." If you're still wrestling with which AI tool to pick in the first place, my earlier post Which AI Coding Tool Should You Pick in 2026? pairs well with this one.

Smart Router: route around degraded models automatically

Beyond monitoring, AIStupidLevel also ships a feature called Smart Router, which is a pretty interesting extension of the project.

It's an OpenAI-compatible API endpoint, meaning code you wrote against OpenAI barely needs to change. You drop each provider's API key into it (stored with AES-256 encryption), point your base URL at it, and it routes each request to whichever model is currently in the best shape based on the live monitoring.

It offers six routing strategies — just set the model field in your API call to one of these and it picks dynamically:

auto-best — overall pick, whichever model has the best combined score right now
auto-coding — best at coding right now
auto-reasoning — strongest at reasoning
auto-creative — leans toward creative output
auto-cheapest — cheapest model above the quality bar
auto-fastest — fastest model above the quality bar

If you just pass auto, it uses whichever strategy you've set as your default. And if you pass a specific model name (like claude-opus-4-7), it pins that model directly and skips the router. So when a model quietly starts slipping and Smart Router catches it, traffic auto-routes to a sibling that's still healthy. Pretty useful if you actually want to wire AI into a product.

Wrap-up

So what is AIStupidLevel?

An independent, open source (MIT) third-party benchmark site monitoring 21 production AI models across 7 providers (OpenAI, Anthropic, Google, xAI, DeepSeek, Kimi, GLM), 24 hours a day
Method: fixed question bank, run repeatedly — each task runs 5 times, median plus 95% confidence interval, across four test suites on a rotation
Uses CUSUM change-point detection plus statistical testing to catch a model quietly degrading within hours
Ships Smart Router that auto-routes API traffic to whichever model is in the best shape based on live monitoring

Next time you feel like the AI suddenly got dumber, don't jump straight to blaming yourself — pop the dashboard open and you might actually find the evidence.

推荐订阅源

DEV Community