惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

W
WeLiveSecurity
T
Tenable Blog
Project Zero
Project Zero
C
Cybersecurity and Infrastructure Security Agency CISA
T
The Exploit Database - CXSecurity.com
P
Palo Alto Networks Blog
S
Schneier on Security
Scott Helme
Scott Helme
S
Securelist
Know Your Adversary
Know Your Adversary
Vercel News
Vercel News
IT之家
IT之家
V
V2EX
F
Fortinet All Blogs
Simon Willison's Weblog
Simon Willison's Weblog
K
Kaspersky official blog
博客园_首页
T
Tailwind CSS Blog
The GitHub Blog
The GitHub Blog
Spread Privacy
Spread Privacy
Microsoft Security Blog
Microsoft Security Blog
Cisco Talos Blog
Cisco Talos Blog
The Register - Security
The Register - Security
有赞技术团队
有赞技术团队
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Cyberwarzone
Cyberwarzone
Google DeepMind News
Google DeepMind News
The Hacker News
The Hacker News
L
LINUX DO - 热门话题
Hugging Face - Blog
Hugging Face - Blog
博客园 - 三生石上(FineUI控件)
A
Arctic Wolf
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
CXSECURITY Database RSS Feed - CXSecurity.com
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
T
Threat Research - Cisco Blogs
P
Proofpoint News Feed
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
P
Privacy & Cybersecurity Law Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
C
CERT Recently Published Vulnerability Notes
S
SegmentFault 最新的问题
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
罗磊的独立博客
Apple Machine Learning Research
Apple Machine Learning Research
P
Proofpoint News Feed
The Cloudflare Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Vulnerabilities – Threatpost

Jonathon Belotti [thundergolfer]

safetykit Can an AI datacenter be beautiful? There's been a vibe shift in vibe coding Keeping 20,000 GPUs healthy The 10 best software podcast episodes I ever heard Larval stage support engineering: great at what doesn’t scale "A Foundational Result in Machine Learning" Gray’s ‘5 minute rule’ in the cloud era Aussie engineers, get to The States!
Failure numbers every programmer should know
Jonathon Belotti · 2026-05-14 · via Jonathon Belotti [thundergolfer]

Peter Norvig’s “Latency Numbers Every Programmer Should Know” are a classic in software engineer training. The original sixteen numbers represent, for programmers, the hard constraints of our hardware. In the early 2000s, if you cared about writing fast code you knew that a disk seek cost about 10 milliseconds.

Latency numbers are for programmers who want their systems to be fast.

Failure numbers are for programmers who want their systems to be reliable.

Thing Type MTTF (years) AFR Notes
CPU failure Hardware ~1,700 ~0.06% Server CPUs very rarely fail outright. Intel IT measured a 0.06% CPU AFR across 223,050 CPUs in 207,956 HPC servers, which converts to an MTTF of roughly 1,700 years by the simple reciprocal math used here.1
Motherboard failure Hardware ~260 ~0.38% Motherboards are still rare failures, but less rare than CPUs. In the same Intel IT dataset, motherboards had a 0.38% AFR, or roughly 260 years MTTF by the same conversion.1
SSD failure Hardware ~100 ~1% Enterprise SSD field data is usually around or below 1% AFR at the headline level, with model, age, and write workload hiding underneath. Backblaze’s SSD boot-drive data is in this ballpark, though it is a much smaller SSD sample than its HDD fleet.2
HDD failure Hardware ~60 ~1.5% Backblaze’s 2025 fleet snapshot reports 1.36% annual AFR and 1.30% lifetime AFR across hundreds of thousands of drives.3 Use 1-2% unless you know the specific drive model and age.
RAM uncorrectable error Hardware ~75 ~1-4% In Google’s DRAM study, 1.29% of machines per year had at least one uncorrectable error, with individual platforms reaching 4.15%.4 One uncorrectable error typically means a machine shutdown and DIMM replacement.
AWS regional outage, non-us-east-1 Service ~4 ~25% Here a failure means a region-scale incident big enough to require application-level mitigation, not every status page blip.
AWS regional outage, us-east-1 Service ~2 ~50% us-east-1 deserves its own row because it is old, huge, and entangled with many AWS control planes. See the October 2025 AWS outage for the shape of one such event.
ElastiCache 50-node cluster failover rate Service ~0.2 (73 days) ~500% AWS documents node replacement and failover as normal ElastiCache operating behavior.5 I here use a 10-year MTTF based on observations of our clusters at Modal. This is a cluster-level operational rate, not a per-node failure rate.
NVIDIA A100 critical error6 Hardware ~0.18 (65 days) ~560% Internal Modal fleet measurements. At this rate, a fleet of 1,000 A100s should expect about 15 critical GPU errors per day.
NVIDIA H100 critical error Hardware ~0.14 (50 days) ~730% Internal Modal fleet measurements.
Cloud VM unavailability Service ~20-100 ~1-5% Cloud providers publish availability SLAs, not clean per-VM failure rates.7 For a single cloud VM, I use 1-5% as a rough annual chance that the VM needs recovery or replacement because the underlying host, network, or power failed underneath it.
Cloud VM disk loss Service ~500-1,000 ~0.1-0.2% AWS EBS gp2, gp3, io1, st1, and sc1 volumes are documented at 99.8-99.9% durability, which AWS also states as 0.1-0.2% annual failure rate.8 io2 Block Express is a different class at 99.999% durability, or 0.001% AFR.
Production bug or defect Software ~0.001-0.005 (12h-2d) ~20k-100k% The most frequent failure mode is us. For active services deploying many times per day, DORA’s change fail rate and deployment rework rate turn into a daily rhythm of defects, hotfixes, and regressions.9