惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
WordPress大学
WordPress大学
小众软件
小众软件
Cloudbric
Cloudbric
AWS News Blog
AWS News Blog
腾讯CDC
量子位
人人都是产品经理
人人都是产品经理
大猫的无限游戏
大猫的无限游戏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Vulnerabilities – Threatpost
Scott Helme
Scott Helme
Hugging Face - Blog
Hugging Face - Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
The Hacker News
The Hacker News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
IT之家
IT之家
Jina AI
Jina AI
Attack and Defense Labs
Attack and Defense Labs
S
SegmentFault 最新的问题
Simon Willison's Weblog
Simon Willison's Weblog
The Cloudflare Blog
阮一峰的网络日志
阮一峰的网络日志
T
Tailwind CSS Blog
Last Week in AI
Last Week in AI
博客园 - 【当耐特】
Google Online Security Blog
Google Online Security Blog
美团技术团队
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
罗磊的独立博客
L
LINUX DO - 最新话题
博客园 - Franky
博客园 - 叶小钗
Apple Machine Learning Research
Apple Machine Learning Research
The Last Watchdog
The Last Watchdog
J
Java Code Geeks
AI
AI
C
Cisco Blogs
酷 壳 – CoolShell
酷 壳 – CoolShell
C
Cyber Attacks, Cyber Crime and Cyber Security
Cisco Talos Blog
Cisco Talos Blog
博客园 - 三生石上(FineUI控件)
雷峰网
雷峰网
Help Net Security
Help Net Security
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
云风的 BLOG
云风的 BLOG
I
Intezer
S
Securelist

Anže's Blog

The 15-Year-Old iptables Rule That Broke My DNS Letting Claude Upgrade My Raspberry Pi Agents Day Lisbon DjangoCon Europe 2026 How to Safely Update Your Dependencies Speeding Up Django Startup Times with Lazy Imports Typing Your Django Project in 2026 Claude Fixes User Bug Jekyll to Hugo Migration Advent of Code 2025 🎄 Django bulk_update Memory Issue Migrating Gunicorn to Granian Disable Network Requests When Running Pytest Disable Runserver Warning in Django 5.2 Autogenerating og:images with Jekyll Power Outages and Gunicorn PID Files UV with Django Go-like Error Handling Makes No Sense in JavaScript or Python Packages Do Not Match the Hashes Pip Error Gotchas with SQLite in Production Fedidevs Dev Update #2 Django SQLite Production Config Django Streaming HTTP Responses Deploying a Django Project to My Raspberry Pi (Video) Thoughts on Code Reviews Django SQLite Benchmark Django, SQLite, and the Database Is Locked Error No Downtime Deployments with Gunicorn SQLite Write-Ahead Logging Writing a Pytest Plugin Fedidevs Dev Update #1 Django-TUI: A Text User Interface for Django Commands Automate Hatch Publish with GitHub Actions Words TUI: App for Daily Writing Textual App Auto Reload RDS Blue/Green Deployments Fly.io Certificate Renewal Using Testing Library with Selenium in Python The Fastest Way to Build a Read-only JSON API import __hello__ Enum with `str` or `int` Mixin Breaking Change in Python 3.11 Your Code Doesn't Have to Be Perfect Fixing _SixMetaPathImporter.find_spec() Not Found Warnings in Python 3.10 Upgrading Django App to Python 3.10 Integer Overflow Error in a Python Application Python Dependency Management MySQL Performance Degradation in Django 3.1 New Features in Python 3.8 and 3.9 The Code Review Batch Size The Code Review Bottleneck
Fedidevs 9h Outage Postmortem
Anže Pečar · 2026-05-13 · via Anže's Blog
13 May 2026

Today I had almost 9 hours of downtime on fedidevs.com and some of my other sites that I run on a Raspberry Pi at home. The alert came in just as I was heading to bed and I didn’t see it until I woke up this morning 🫣

Since Jake on Mastodon asked for a Cloudflare-style postmortem, here it is:

Incident report: ~9h loss of upstream connectivity on the raspberrypi host

Date: 2026-05-12 / 2026-05-13

Duration: 8h 53m

Impact window: 2026-05-12 23:00 UTC → 2026-05-13 07:53 UTC

Severity: SEV-3 (single-host, no public traffic served during the window because outbound DNS was the failure mode)

We’d like to acknowledge the disruption this caused — for everyone who depends on a Raspberry Pi in a flat in Lisbon, this was an unacceptable outage. We are sorry, and we are taking steps to make sure this does not recur in the same form.

What happened

At 23:00:09 UTC on 2026-05-12 (00:00 local time), the raspberrypi host lost the ability to send packets beyond its default gateway (192.168.86.1). DNS queries against the gateway began timing out immediately. The WiFi radio remained associated to the access point throughout the entire incident — cfg80211 reported no deauthentication, reassociation, or carrier-loss events. From the kernel’s perspective, the link was healthy. From every userspace service’s perspective, the internet had ceased to exist.

The host stayed in this state for 8 hours and 53 minutes, until a human operator (a single human, who was asleep) walked over and pulled power at 07:53 UTC.

Background

The raspberrypi host is a Raspberry Pi 5 (Debian Bookworm, kernel 6.12.34) connected over WiFi to a home router. It runs a small fleet of services, including:

  • A gunicorn-served applications (fedidevs and others), with a Celery worker.
  • PostgreSQL 15, Redis, nginx.
  • A New Relic infrastructure agent for observability.

The previous boot had been running for 349 days continuously. We mention this because it is relevant to the size of the gap between “we have monitoring” and “we have monitoring that would have noticed.”

Timeline (UTC)

TimeEvent
2026-05-12 23:00:09First DNS timeout observed: ipster fails to resolve api.cloudflare.com against 192.168.86.1:53 (i/o timeout).
2026-05-12 23:00:09 → 07:53:00Continuous, identical failure mode across every service that initiates outbound traffic. New Relic accumulates a 530,000+ event backlog. Gunicorn’s OTLP exporter queues ~129 retries. WiFi stays associated.
2026-05-13 07:53:00Operator initiates hard reboot.
2026-05-13 07:53:50Host comes back up. Connectivity restored on first attempt.
2026-05-13 08:42:00Operator opens a session and asks the on-call AI what to do about it.
2026-05-13 08:48:43Recovery automation (net-watchdog.timer) deployed and enabled.

Root cause

We cannot prove the upstream trigger from the host’s logs alone — the journal volume from the affected window contains no kernel WiFi events — but the signature is consistent and well-known. At exactly 00:00 local time, the gateway almost certainly performed a scheduled action (reboot, firmware update, or DHCP lease housekeeping). When it returned to service, the Pi’s WiFi stack remained associated to the BSSID at the radio layer but did not re-establish a working data path. NetworkManager did not observe a carrier event and therefore had no signal to act on. The connection’s connectivity check ran in the background and may well have flipped to limited, but no action is taken on that signal by default.

In short: the link was up, the route was installed, the radio was happy, and no packets came back.

Detection

There was none. The incident was detected by a human noticing the next morning that pages did not load. There was no alert, no automated check, and no log-based watchdog that would have escalated.

Remediation

We have shipped a connectivity watchdog (net-watchdog, on a 2-minute systemd timer) that performs an active reachability check against three upstream anycast addresses (1.1.1.1, 8.8.8.8, 9.9.9.9) bound to wlan0. On consecutive failures it executes an escalation ladder:

  1. Observe (one failure can be a packet drop).
  2. nmcli device reapply wlan0.
  3. nmcli device disconnect && connect wlan0.
  4. Restart NetworkManager.
  5. Bounce the wlan0 interface at the link layer.
  6. Reboot, as a last resort, after ~24 minutes of confirmed unreachability.

All actions log to the journal under the net-watchdog tag.

Worst-case time-to-recover under this design is ~24 minutes, down from “however long until a human notices.” We consider this acceptable for the operating environment (a home server, one human, no SLA), but it is not zero, and we will continue to look for ways to reduce it.

What we’re still not doing

  • We are not yet alerting off-host when the watchdog escalates. If the recovery itself fails (e.g., the WiFi driver is wedged at the kernel level and an ip link bounce doesn’t help), the only signal will be that the host stops responding entirely — which is, again, “human notices the next morning.”
  • We have not addressed the upstream cause. The router presumably will do whatever it did again. The Pi must remain resilient to it.
  • We have no second physical link (no ethernet). The watchdog cannot route around a failed radio.

We do not, at this time, plan a global incident review.

— The Raspberry Pi reliability team (n=1, including the AI)