























Wavebreakmedia Ltd IFE-240525_3/Alamy
Cloud outages drove headlines in 2025 with disruptions across major providers and hundreds of millions in estimated losses. But the havoc wasn't caused for only the reasons many enterprise and industrial IT leaders expected. In several high-profile incidents, the underlying infrastructure remained fully functional.
Power systems were stable. Compute and storage capacity was available. Networks were up. Yet critical services still went down.
Across multiple industry analyses, a pattern has emerged: Failures increasingly originate not in the data plane — where workloads run — but in the control and management layers that coordinate, authenticate, configure and orchestrate systems at scale.
According to Uptime Institute's 7th Annual Outage Analysis, IT and networking outages increased in 2024, accounting for 23% of impactful outages, reflecting increased IT and network complexity that led to issues with change management and misconfigurations. This represents a fundamental shift in the outage landscape, one that hardware redundancy cannot address: Infrastructure didn't fail, control did.
Related:FinOps: Helpful tool, or a cloud control placebo for CIOs?
Industry analysts are drawing the same conclusion. The 2024 Gartner report "9 Principles for Improving Cloud Resilience" noted that control plane failures can prevent operators from executing remedial actions even when data-plane traffic is still flowing, blocking provisioning, configuration changes and automated recovery actions at the very moment they are needed most. In these scenarios, resilience depends less on redundant infrastructure and more on prebuilt contingency plans and tested operational procedures.
Modern cloud and distributed environments depend on control planes. These are centralized or semi-centralized systems that handle orchestration, policy enforcement, identity, routing and lifecycle management. These layers act as the operational "brain" of digital infrastructure.
Over time, these control systems have become more automated, more feature-rich and more centralized. That improves efficiency, but it also increases risk. When a control plane misconfigures resources or becomes unavailable, the impact can extend across regions, sites and services simultaneously.
For years, resilience strategy focused on redundancy: duplicate servers, replicated storage and distributed clusters. These measures protect execution capacity. However, they do not guarantee operational continuity when orchestration and management layers fail.
When control systems are impaired, organizations may encounter the following:
Related:Ask the Experts: CIOs say they wouldn’t pull workloads back from the cloud
Applications may continue running, but they cannot be reached.
Systems remain healthy, but they cannot be reconfigured.
Identity and access services are online but unusable.
Automation pipelines propagate errors faster than teams can respond
For industrial and enterprise operators, this creates a dangerous illusion of availability without operability. It's comparable to a production facility with fully functional machinery but no control system to coordinate operations.
The stakes will only go higher as environments become increasingly software-defined, more complex and more automated, while still being highly dependent on humans to avoid mistakes. Outage analyses across the industry continue to show that process breakdowns and human error remain major contributors, especially during change events. It's no wonder; operational teams now manage hybrid estates spanning cloud, edge, on-premises and third-party platforms, which are often connected through layered automation and policy engines. Each added integration point increases coupling and reduces transparency. At the same time, enterprises are pushing faster release cycles, more infrastructure as code and broader automation — all positive trends, but ones that require stronger guardrails and validation.
Related:Ask the Experts: The cloud cost reckoning
The result is a risk multiplier: higher system complexity, combined with faster change velocity and centralized control authority.
For industrial and enterprise operators, outages are not just digital events; they are operational events. Downtime can halt production lines, interrupt field operations, delay logistics, disrupt communications or affect safety systems.
These environments cannot rely solely on remote or centralized recovery. They require architectures that can sustain safe, predictable operation even when upstream control systems are degraded.
That requires designing for operational independence, not just availability.
Key architectural priorities increasingly include:
Distributed control with site-level autonomy.
Local survivability during WAN or cloud control loss.
Fault domains that limit orchestration blast radius.
Deterministic behavior under degraded connectivity.
Change validation and staged rollout controls.
Operational guardrails that constrain automation risk.
Traditional resilience metrics emphasize uptime, focusing on whether infrastructure is reachable and powered. But for industrial and enterprise systems, the more meaningful measure is operational continuity: Ensuring systems remain controllable, observable and safe under stress.
A system that is technically "up" but cannot be managed, authenticated or reconfigured is not operationally available.
As enterprises expand edge deployments, adopt AI-driven workloads, and increase automation across infrastructure, the control plane becomes a primary risk surface.
Resilience strategies must evolve, extending beyond redundant hardware and multi-region failover to include distributed control design, process discipline and failure-containment architecture. This is a new architectural mindset, one that extends resilience to all the pieces that collectively determine how a cloud operates under pressure.
In an era defined by digital dependence, the real measure of cloud resilience is the ability to continue functioning when the unexpected happens. The lesson from outage trends is clear: Resilience is no longer defined by only what keeps running, but by what remains in control.
Wind River
Warren Bayek leads the CTO Office for Wind River's cloud strategy, including 5G cloud and virtualization, working with telco providers to accelerate virtual and Open RAN adoption. He has 30 years of experience designing and delivering complex, high-availability software in the telecom sector across large enterprises and startups.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。