惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

量子位
S
Securelist
MyScale Blog
MyScale Blog
Jina AI
Jina AI
罗磊的独立博客
The Cloudflare Blog
美团技术团队
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
博客园 - 三生石上(FineUI控件)
月光博客
月光博客
雷峰网
雷峰网
小众软件
小众软件
aimingoo的专栏
aimingoo的专栏
大猫的无限游戏
大猫的无限游戏
博客园 - Franky
博客园 - 聂微东
Y
Y Combinator Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
MongoDB | Blog
MongoDB | Blog
T
Tailwind CSS Blog
Attack and Defense Labs
Attack and Defense Labs
博客园_首页
Latest news
Latest news
Apple Machine Learning Research
Apple Machine Learning Research
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Hacker News
The Hacker News
G
GRAHAM CLULEY
Simon Willison's Weblog
Simon Willison's Weblog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
U
Unit 42
D
Docker
Webroot Blog
Webroot Blog
N
Netflix TechBlog - Medium
T
Tor Project blog
C
Cyber Attacks, Cyber Crime and Cyber Security
L
LINUX DO - 最新话题
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
The Last Watchdog
The Last Watchdog
B
Blog
Recent Announcements
Recent Announcements
GbyAI
GbyAI
Microsoft Azure Blog
Microsoft Azure Blog
Security Latest
Security Latest
V2EX - 技术
V2EX - 技术
N
News | PayPal Newsroom
Microsoft Security Blog
Microsoft Security Blog

Comments for Last Week in AWS

The AWS Service I Hate the Most The AWS Managed NAT Gateway is Unpleasant and Not Recommended The Unfulfilled Promise of Serverless
Lessons in Trust From us-east-1
Corey Quinn · 2021-12-15 · via Comments for Last Week in AWS

Home Blog Lessons in Trust From us-east-1

AWS published its analysis of last week’s us-east-1 outage, and it raises more questions than it answers. I understand that they wanted to get it out when they did (late on a Friday during one of the worst cybersecurity flaps in years), to avoid excessive attention. But I’m unconvinced in reading it that the outage was truly understood by AWS itself, and what we as customers should do about it.

Some of the concerns are relatively banal. Services that were mentioned in the write-up as being impacted were green throughout on the status page during and after the incident. AWS doesn’t appear to understand every aspect of what went wrong, but they know it was triggered by an internal networking autoscaling event. They explicitly state as a result that they’re going to stop autoscaling their internal network until that’s corrected. So on the one hand, good for them on being responsible. On the other … damn.

As we unfortunately discovered last week, you cannot have a multi-region failover strategy on AWS that features AWS’s us-east-1 region. Too many things apparently single-track through that region for you to be able to count on anything other than total control-plane failure when that region experiences a significant event. A clear example of this is Route 53’s impairment: “Route 53 APIs were impaired from 7:30 AM PST until 2:30 PM PST preventing customers from making changes to their DNS entries, but existing DNS entries and answers to DNS queries were not impacted during this event.” Read another way, “we didn’t violate our SLA, but if you were using Route 53 for DNS, you could make no changes to where traffic was directed for seven hours.” As of this writing, Amazon.com itself doesn’t use Route 53 for its public DNS, choosing instead to use both UltraDNS and Oracle’s Dyn. Yes, the same Oracle they castigate on stage from time to time. I’ve yet to hear of a single disaster recovery plan that would survive intact if you could make no DNS changes during an event.

Don’t use Route 53 for public records” is an unfortunate takeaway we’re left with from this experience.

Let’s also address the giant problem in the room that exists in the form of AWS SSO, or “Single Sign On.” The “Single” in the name is a heck of a clue; you can configure it in exactly one region. From their documentation comes this gem:AWS Organizations only supports one AWS SSO Region at a time. If you want to make AWS SSO available in a different Region, you must first delete your current AWS SSO configuration. Switching to a different Region also changes the URL for the user portal.

To frame that slightly differently, if there’s an outage in the region that contains your SSO configuration, you’d better have another way into the account if you’d like to do anything in your cloud environment.

Liz Fong-Jones from Honeycomb reported significant issues with KMS and SSM during the event; the AWS analysis makes no mention of these services being impacted. Liz is far from the only person who noticed degradation of these services (she just happens to be one of the folks I trust implicitly when it comes to understanding what’s broken!), so I don’t believe that this is some sort of fever-dream or a weird expression of one company’s software architecture. I’m left with the unfortunate reality that AWS either does not know about or does not disclose all of its various service interdependencies.

DynamoDB and S3 gateway endpoints in subnets were impacted; some folks had to resort to using the dreaded Managed NAT Gateways with their 4.5 cents per gigabyte data processing fee. For some folks, this resulted in significant cost. If you were one of those customers, reach out to your AWS Account Manager for a concession for these charges. You shouldn’t have to eat fees that you paid to work around a service degradation. If they say no, please let me know; I’d be very interested to hear how customer obsession plays out in the wake of this mess.

I’ve previously said that before you go multi-cloud you should go multi-region. I stand by that advice. However, it sure would be swell if AWS didn’t soak customers with ridiculous data transfer fees to move data between regions as well as between availability zones within the same region. To review: data from the internet into AWS is free; moving data between availability zones and regions starts at 2 cents per gigabyte and increases significantly from there. Data to the internet from AWS is significantly more expensive. Viewed through this lens, AWS’s exhortations to build applications across regions and availability zones is less an encouragement to ensure application durability and more of a ham-fisted sales pitch. Unfortunately there’s really no lesson to take from this; we’re stuck with the understanding that there is always a trade-off between cost and durability, and AWS is going to milk customers like cows to achieve significant reliability.

To be very clear on my position here: AWS does a hell of a better job than you or I will running our own infrastructure. They’re fanatical about reliability and protecting it. But there’s something about this outage and its analysis that really, really rubs me the wrong way. Trust is everything when it comes to cloud providers, and there’s frankly enough wrong with Amazon’s public outage analysis to make me question exactly how far it can be trusted.

Corey Quinn Headshot

by Corey Quinn

Corey is the Chief Cloud Economist at Duckbill, where he specializes in helping companies improve their AWS bills by making them smaller and less horrifying. He also hosts the "Screaming in the Cloud" and "AWS Morning Brief" podcasts; and curates "Last Week in AWS," a weekly newsletter summarizing the latest in AWS news, blogs, and tools, sprinkled with snark and thoughtful analysis in roughly equal measure.

Billie Holding Mail Email Subscribe Icon

Get the newsletter!

Stay up to date on the latest AWS news, opinions, and tools, all lovingly sprinkled with a bit of snark.

"*" indicates required fields