惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

C
Check Point Blog
月光博客
月光博客
V
Visual Studio Blog
J
Java Code Geeks
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Project Zero
Project Zero
K
Kaspersky official blog
Cisco Talos Blog
Cisco Talos Blog
人人都是产品经理
人人都是产品经理
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
腾讯CDC
S
Schneier on Security
T
Tor Project blog
C
Cisco Blogs
F
Full Disclosure
云风的 BLOG
云风的 BLOG
P
Palo Alto Networks Blog
博客园 - 司徒正美
罗磊的独立博客
Y
Y Combinator Blog
P
Proofpoint News Feed
IT之家
IT之家
T
The Exploit Database - CXSecurity.com
G
GRAHAM CLULEY
阮一峰的网络日志
阮一峰的网络日志
T
Threat Research - Cisco Blogs
MyScale Blog
MyScale Blog
Engineering at Meta
Engineering at Meta
B
Blog
I
InfoQ
C
Cybersecurity and Infrastructure Security Agency CISA
酷 壳 – CoolShell
酷 壳 – CoolShell
量子位
V
V2EX
博客园 - 【当耐特】
L
LINUX DO - 热门话题
V
V2EX - 技术
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
P
Proofpoint News Feed
SecWiki News
SecWiki News
Microsoft Security Blog
Microsoft Security Blog
Hacker News: Ask HN
Hacker News: Ask HN
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
大猫的无限游戏
大猫的无限游戏
Vercel News
Vercel News
Last Week in AI
Last Week in AI
E
Exploit-DB.com RSS Feed
S
Security Affairs
GbyAI
GbyAI

Swift for Visual Studio Code comes to Open VSX Registry | InfoWorld

Notion courts developers with a platform for AI agents and workflow automation Using continuous purple teaming to protect fast-paced enterprise environments A better way to work with SQL Server AWS debuts Graviton-powered Redshift RG instances to cut analytics costs SAP’s AI promises last year? Most are still rolling out First look: Lemonade serves up local AI with limitations GitLab CEO sees developer tool bill increasing 100-fold Red Hat adds support for agentic AI development What’s new and exciting in JDK 26 Kill the loading spinner with local-first data and reactive SQL A networking revolution at AWS Tokenmaxxing is super dumb How to add AI to an existing product (without annoying users) Your AI doesn’t need another database What happens when engineering teams reorganize around AI agents Python isn’t always easy When cloud giants meddle in markets 12 model-level deep cuts to slash AI training costs The best new features in Python 3.15 Teradata launches platform for enterprise AI agents moving beyond pilots Three skills that matter when AI handles the coding MongoDB targets AI’s retrieval problem Building AI apps and agents with Microsoft Foundry Designing front-end systems for cloud failure No, AI won’t destroy software development jobs Diskless databases: What happens when storage isn’t the bottleneck Vibe coding or spec-driven development? The agentic AI distraction Vibe coding or spec-driven development? How to choose Cloud providers are blinded by agentic AI SAP to acquire data lakehouse vendor Dremio Small language models: Rethinking enterprise AI architecture Making AI work through eval hygiene Improving AI agents through better evaluations AI in the cloud is easy but expensive Running AI in the cloud is easy – and expensive Making AI work for databases Harness teams of agentic coders with Squad Harness teams of coding agents with Squad Oracle NetSuite announces AI coding skills for SuiteCloud developers Why it’s so hard to create stand-alone Python apps A new challenge for software product managers The hidden cost of front-end complexity GitHub shifts Copilot to usage-based billing, signaling a new cost model for enterprise AI tools OpenAI’s Symphony spec pushes coding agents from prompts to orchestration The front-end architecture trilemma: Reactivity vs. hypermedia vs. local-first apps Enterprise AI is missing the business core The best JavaScript certifications for getting hired Google begins putting the guardrails on agentic AI Why world models are AI’s next frontier Where to begin a cloud career Google pitches Agentic Data Cloud to help enterprises turn data into context for AI agents How open source ideals must expand for AI Is your Node.js project really secure? How I doubled my GPU efficiency without buying a single new card SpaceX secures option to acquire AI coding startup Cursor for $60B Google’s Gemma 4 shines on local systems – both big and small AI is upending the SaaS game How AI is upending SaaS tools Snowflake offers help to users and builders of AI agents From the engine room to the bridge: What the modern leadership shift means for architects like me Addressing the challenges of unstructured data governance for AI The cookbook for safe, powerful agents Enterprises are rethinking Kubernetes GitHub pauses new Copilot sign-ups as agentic AI strains infrastructure Best practices for building agentic systems Making agents dull Oracle delivers semantic search without LLMs When cloud giants neglect resilience Exciting Python features are on the way Ease into Azure Kubernetes Application Network The agent tier: Rethinking runtime architecture for context-driven enterprise workflows The two-pass compiler is back – this time, it’s fixing AI code generation MuleSoft Agent Fabric adds new ways to keep AI agents in line Salesforce launches Headless 360 to support agent‑first enterprise workflows Tap into the AI APIs of Google Chrome and Microsoft Edge Where will developer wisdom come from? GitHub adds Stacked PRs to speed complex code reviews The hyperscalers are pricing themselves out of AI workloads HTMX 4.0: Hypermedia finds a new gear Google Cloud introduces QueryData to help AI agents create reliable database queries Hands-on with the Google Agent Development Kit Are AI certifications worth the investment? AWS targets AI agent sprawl with new Bedrock Agent Registry Cloud degrees are moving online Swift for Visual Studio Code comes to Open VSX Registry AI agents aren't failing. The coordination layer is failing How Agile practices ensure quality in GenAI-assisted development Anthropic rolls out Claude Managed Agents Microsoft’s reauthentication snafu cuts off developers globally Meta’s Muse Spark: a smaller, faster AI model for broad app deployment Bringing databases and Kubernetes together Rethinking Angular forms: A state-first perspective Minimus Welcomes Yael Nardi as CBO to Facilitate Strategic Growth Microsoft announces end of support for ASP.NET Core 2.3 Get started with Python’s new frozendict type AWS turns its S3 storage service into a file system for AI agents Microsoft’s new Agent Governance Toolkit targets top OWASP risks for AI agents The winners and losers of AI coding GitHub Copilot CLI adds Rubber Duck review agent
The causes of cloud outages are changing
David Linthicum · 2026-06-12 · via Swift for Visual Studio Code comes to Open VSX Registry | InfoWorld

The latest outage data show that the cloud’s operational complexity, process failures, and control-plane errors are overshadowing infrastructure failures.

For years, the cloud market has made a simple promise: Move workloads to large-scale platforms, gain better resilience, and worry less about downtime. That promise was never entirely wrong, but it is becoming less complete. The latest findings from Uptime Institute’s seventh Annual Outage Analysis suggest that the outage landscape is changing in ways that should concern both cloud providers and cloud customers. The biggest risks are no longer limited to broken physical infrastructure. They are increasingly tied to the complexity of the systems used to run, coordinate, update, and recover that infrastructure.

The most alarming number in the report is that IT and networking issues accounted for 23% of impactful outages in 2024. Uptime Institute links these increases to growing IT and network complexity; the long-term shift toward colocation, cloud, and third-party digital services; and the resulting increase in change-management failures and misconfigurations. That number is more than a statistical footnote. It points to a structural change in how outages happen and why cloud outages are becoming such a stubborn problem.

Hardware redundancy can protect against component failures, but it doesn’t help much when the outage stems from a bad configuration, an automation error, a faulty network change, or an underappreciated control-plane dependency. In those cases, the infrastructure itself may remain intact while the system that governs it breaks down. The industry is learning that resiliency is less about duplicating equipment and more about managing complexity. Today’s increasingly distributed and software-defined environments cannot operate safely at scale.

Failures at the operational level

Uptime’s findings show that power remains the leading cause of major outages, underscoring that traditional infrastructure engineering still matters a great deal. But even as providers continue to improve physical resilience, outages can still arise from the digital and procedural layers above it. Cloud platforms are now dense stacks of services, APIs, orchestration systems, software-defined networks, identity controls, failover logic, and third-party dependencies. That complexity creates more possible points of interaction and more opportunities for an error in one layer to cascade into several others.

This helps explain why outages can feel more surprising today than they did a decade ago. In older data center models, an outage often had a more apparent root cause, such as a power event, a cooling failure, or a hardware fault. In cloud environments, the trigger may be a small configuration change that propagates across regions, a policy update that unintentionally blocks service communication, or a network control failure that affects seemingly unrelated services. These are not failures of raw infrastructure capacity. They are failures of complexity management.

The report’s language around change management and misconfiguration is especially important because it challenges one of the most common assumptions in the cloud market: that scale automatically produces better operational outcomes. The reality? Scale can magnify both strengths and weaknesses. Large cloud providers have more engineering talent, more sophisticated tools, and more redundancy than almost any enterprise customer. But they also run far more interconnected systems at far greater speeds with far more automation. A single process failure can have a wider blast radius.

Another important lesson from the Uptime analysis is that automation has not removed the human factor. If anything, it has changed its form. Even in highly automated environments, human error remains central to the problem. The report notes that in 2025, the share of outages caused by human failure to follow procedures rose by 10 percentage points compared with 2024. A related industry summary of the report notes that 58% of human error-related outages were caused by staff failing to follow established procedures.

That matters because cloud providers often position automation as the answer to reliability. Automation is essential, but it only works as well as the operational model that surrounds it. If teams deploy changes too quickly, rollback paths are weak, approval chains are bypassed, or procedures are incomplete, automation can accelerate failure rather than prevent it. In a modern cloud environment, a human mistake is rarely just a single keystroke. It is more often a design weakness in process, governance, testing, or accountability.

This is also why customers should resist the comforting notion that outages are somebody else’s problem once workloads move to the cloud. Provider-side mistakes remain real, but customer architectures are increasingly entangled with provider networking, identity, observability, and platform services. When an outage occurs, the customer may not have caused it, but they still bear the business impact. The shared responsibility model does not end with security. It extends to resilience planning as well.

Better change management

The Uptime data points to a clear conclusion: Cloud providers need to treat operational discipline as a first-class design requirement. That starts with better change management. High-risk changes should be tested more aggressively, staged more gradually, and accompanied by stronger rollback mechanisms. Providers also need better dependency mapping to understand how a change in one control layer can affect services far beyond its immediate scope. If the system is too complex to clearly explain, it is too complex to operate.

Providers also need to improve procedural quality. The rise in outages caused by failing to follow procedures suggests that procedures are being ignored under operational pressure or that they are too cumbersome, outdated, or unclear for real production conditions. Neither explanation is comforting. Stronger runbooks, better training, more realistic failure drills, and tighter operational guardrails are not glamorous investments; they are increasingly central to resilience.

Another pressure point is visibility. Uptime notes that software-based and distributed resiliency tools can improve availability, but they also introduce new risks and complicate root-cause analysis. Cloud providers need more transparent and faster incident diagnosis, not just more layers of abstraction. Customers cannot build trust in resilience if every major incident becomes a long exercise in reconstructing opaque service dependencies after the fact.

Design with outages in mind

What’s the financial impact of more frequent problems? Uptime’s 2024 analysis found that 54% of respondents reported that their most recent significant outage cost more than $100,000, and 20% said it cost more than $1 million. These are not edge-case losses. They show that outages remain costly even if they are less frequent than in earlier years.

Customers need to stop evaluating cloud resilience through uptime promises and start evaluating it through failure behavior. How does a provider isolate faults? How transparent is incident communication? How portable are workloads if a major service degrades? How dependent is the architecture on a single region, network path, identity service, or control plane? These are not just technical questions; they are now critical business questions.

The core lesson from Uptime’s data is simple. Outages are becoming a bigger problem for cloud providers and customers because the cloud’s biggest vulnerabilities are increasingly tied to complexity, process failures, and control-plane mistakes, not just broken infrastructure. In addition to adding redundancy, the next phase of cloud improvement will focus on building systems that are easier to understand, safer to change, and more disciplined to operate.