惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

Catchpoint Blog

SRE Report: AI optimism and the economics of effort SRE Report: Why fast is what users trust The SRE Report 2026: Defensible Ns SRE Report 2026: What surprised us, what didn't, and why the gaps matter most Why Synthetic Tracing Delivers Better Data, Not Just More Data A New Chapter: LogicMonitor + Catchpoint – A Personal Note from Mehdi Mezmo + Catchpoint deliver observability SREs can rely on The four pillars holding up your digital business, and what happens when they crumble When payments pause: lessons from a global payments outage Observability 2025 Decoded: What the DZone Report Means for SLO-Driven Ops The next evolution of WebPageTest has arrived, and it’s a game-changer The Monitoring Blind Spot That Could Cost You Black Friday Powering Mexico’s Digital Future: Expanded Internet Observability with Catchpoint The Next Chapter of WebPageTest: Your New Experience Starts Soon SRE Report Retrospectives — Have AIOps Predictions Held Up? When BGP becomes UX: The inside story of a SaaS routing decision gone wrong (or right) Session Replay explained: A guide to seeing digital experience through your user’s eyes Making the invisible visible: Are your cloud firewalls and DDoS protection really working? Why it’s time to move beyond APM: Monitoring from the user’s perspective When metrics mislead: Inside the 2025 Retail Web Performance Benchmark The vendor trap: why your next outage won’t be your fault—but will be your problem LLMs don’t stand still: How to monitor and trust the models powering your AI Semantic Caching: What We Measured, Why It Matters The Annual SRE Survey Is Open—We Want to Hear from You Observability isn’t about the tool. It’s about the truth Invisible dependencies, visible impact: Lessons from the Google Cloud outage Real-time detection of BGP blackholing and prefix hijacks Leading analyst firm reveals the real cost of internet disruptions The Power of Over 3000 Intelligent Observability Agents Monitoring in the Age of Complexity: 5 Assumptions CIOs Need to Rethink Why Intelligent Traffic Steering is Critical for Performance and Cost Optimization Retail digital performance event recap: Key insights from IBM & Catchpoint Zendesk outage: A case for proactive monitoring and faster incident response Silence during chaos: Why the X outage is a call to arms for proactive monitoring The $1 Million Lesson: Building a Culture of Quality Through SLAs When AI tools fail: How to map your AI dependencies for proactive visibility Why Super Bowl 2025 was a triumph for Internet Resilience Why Internet Performance Monitoring is the new health check for IT organizations Why use Playwright in Catchpoint for synthetic monitoring Introducing WebPageTest Expert Plan: Real-Time Insights, Synthetic + RUM together in One Platform The shift to digital: How businesses are reshaping their priorities for 2025 The SRE Report 2025's Call to Action Monitoring in the Age of the Internet: DEM, IPM, and APM—What You Need to Know SSL Monitoring, Trust, and McLOVIN Performing for the holidays: Look beyond uptime for season sales success Lessons from Microsoft’s office 365 Outage: The Importance of third-party monitoring Web Performance Experts Look into the Future of Web Performance The hidden challenges of Internet Resilience: Key insights from 2024 report When SSL Issues aren’t just about SSL: A deep dive into the TIBCO Mashery outage The curious case of Marriott and the untold impact of web performance on revenue Preparing for the unexpected: Lessons from the AJIO and Jio Outage It’s time to stop neglecting the elephant in the room: Performance Matters! The Need for Speed: Highlights from IBM and Catchpoint’s Global DNS Performance Study Learnings from ServiceNow’s Proactive Response to a Network Breakdown Webinar Recap: Taking Web Performance to the Next Level Use the Catchpoint Terraform Provider in your CI/CD workflows Is the Internet ready for L4S? July 19th global IT outage reminds us of digital complexity Agentic AI: Powerful But Fragile—What You Need to Know Demystifying API Monitoring and Testing with IPM Cloudflare outage: another wake-up call for resilience planning Cloudflare’s Resolver Outage: More Than Just DNS Cloud Monitoring's Blind Spot: The User Perspective Connected Devices: Unlocking the next frontier of Internet Performance Monitoring Consolidation and Modernization in Enterprise Observability Catchpoint named a leader in the 2024 Gartner® Magic Quadrant™ for Digital Experience Monitoring Catchpoint Peak Performance Summit 2025: Redefining Observability for the Outcome Economy Catchpoint Expands Observability Network to Barcelona: A Growing Internet Hub Catch frustration before it costs you: New tools for a better user experience Creating the IPM Category: Catchpoint’s Journey to Leadership and the LogicMonitor Era AWS Outage: How do you prepare for the failure of your own safety net? Achieving stability with agility in your CI/CD pipeline APM vs Observability: Observing beyond APM APM vs Observability: What comes next? APM vs observability: why your definitions are broken AppAssure: Ensuring the resilience of your Tier-1 applications just became easier APM vs Observability: Both-and, not either-or 2024: A banner year for Internet Resilience 5 Actions you can take to improve digital performance Fast and furious: The importance of performance in the digital age How SAP achieved world-class uptime through modern observability How AI Turns Monitoring From “What Now?” Into “What’s Next?” How IPM helped a top tech brand catch an OpenAI outage before it became a crisis Google’s Agent-to-Agent (A2A) Protocol is here—Now Let’s Make it Observable Here’s the proof: What the fastest sites on the web have in common Going for gold: Testing the resilience of Olympic websites From SEO to AEO: Why Web Performance Is the Key to AI Search Success From the source to the edge: the six agent types you can’t ignore Getting Started with Traceroute How to Monitor AI Agents in Commerce Systems From refresh to results: the metrics that shaped Election Day 2024 coverage Escalating risk, shrinking margins: The 2025 Internet Resilience Report Don’t get caught in the dark: Lessons from a Lumen & AWS micro-outage ECN explained: Navigate congestion for faster, smoother data delivery DNS misconfiguration can happen to anyone - the question is how fast can you detect it? Diagnosing Wi-Fi failures that traditional tools miss: a case study Did Delta's slow web performance signal trouble before CrowdStrike? Customer Survey 2024: Unveiling insights and impact Critical Requirements for Modern API Monitoring
Takeaways from the CrowdStrike outage: third-parties can pose risk
2024-07-24 · via Catchpoint Blog

“You can’t start a fire without a spark,” so sings The Boss. You can’t run an organization these days without digital dependencies. Nor – or perhaps because of this - can we outrun digital failure.

Whether it’s on the global scale of CrowdStrike crashing 8.5 million devices - and with it bringing a halt to entire industries - or a smaller yet still costly issue like the cart stopping working on your eCommerce site thanks to your cloud provider going down (read our perspective on both of Friday’s outages here).

Our Head of Operations issued internal teams a reminder today that there is no easy fix to these kinds of issues, such as simply deciding to remove single points of failure. “True multi-vendor always sounds easier than it actually is,” he counseled – whether that’s running Linux and Windows for the same functions or attempting to split endpoint security between two different vendors… Even less easy are the issues being hotly discussed in the news around concentration and consolidation of large tech solutions.  

One thing that is painfully clear and an easy cause for agreement, however, is that this will have caused sleepless nights among many, many IT departments and those teams merit our thanks!

Major IT Outage hits banks, airlines, businesses worldwide

BSOD screens at an airport in New Delhi, India (Photo by Amarjeet Kumar Singh/Anadolu via Getty Images)

Still in progress… 5 takeaways from the CrowdStrike outage

So while the lessons from the CrowdStrike outage are unfolding in real time as recovery efforts, which could take weeks, continue; and yes, both CrowdStrike and Microsoft have quickly put out guidance for recovery… There are nonetheless several general takeaways already evident. Some of ours here:

#1 - Everything is digital

As a result of CrowdStrike’s faulty software update, airlines were grounded (some still are), and banks, schools, governments, and businesses around the world have all been impacted. Furthermore, when societal functions as important as healthcare and the emergency services are underpinned by digital systems – and fail because of it - the need for digital resilience is made startlingly clear.  

#2 - We are all digitally interdependent (and can’t escape failure)

In Catchpoint’s Internet Resilience Report 2024, we interviewed over 300 global digital leaders about digital and Internet Resilience. One of the questions we put to the field was about their reliance on third-party providers. All respondents, bar 1%, said they had some reliance on third-party platform technology providers and 77% said this was extremely or highly critical to their digital or Internet Resilience success.

We can’t remove these dependencies. They are so multiple and so intertwined to enable our sites and applications to function and our machines to stay secure, etc. And we also know they will fail at some point since, as Werner Vogels, CTO of AWS infamously stated, “Everything fails all the time.” However, your IT teams can chart your dependencies and monitor them as thoroughly as you can. Catchpoint’s Internet Stack Map can help you do that for your applications and services.

It can be seemingly the smallest of issues that cause big failures. This happened to us a few years ago with an unforeseen issue caused by an update to our Let’s Encrypt certificates. It could be as easily a BGP, CDN or DNS issue you may be unaware of.

#3 - Be prepared (as best you can) for failure

If you know that a key provider is planning a major system update in a week or two’s time, you should aim to ensure that you will be able to stay resilient and bounce back quickly if there are problems. To do this, ready your teams with:

  1. A crisis call plan (who will be on the call, what steps should likely be taken, who to be contacted with specific issues, etc.)
  2. A clear understanding of the consequences of any failure
  3. A monitoring and observability plan that covers all bases
  4. A communications plan and easy-to-populate templates for how to share difficult information with customers, users and the wider public
  5. A process in place for effective postmortems
“There will be a lot of conversations around "trust but verify" on allowing auto-updating of tools and OSes. I know security teams would love it, but in terms of preventing this type of outage in particular, a staged rollout with verification before continuing should be the modus operandi goal.”

Tony Ferrelli, VP, Operations, Catchpoint

#4 - Test, baby, test… and monitor continuously!

When making software updates or change configurations, test extensively on a variety of systems before deployment. Testing needs to be implemented in testing environments that emulate real-world scenarios - to include older systems that might be in use by clients. Actually, in the unconfirmed instance of Southwest whose planes have stayed aloft, unaffected by the global outage, it may be because they’re using Windows 3.1. Although, this equally may be a joke... especially considering Windows 3.1 is from 1992.  

At any rate, the outage highlights a painful gap in CrowdStrike’s testing and validation processes.  

Put in place monitoring tools that can help detect issues early. Test before, during and after deployment and monitor what happens so that you have clear benchmarks to understand exactly what is happening. Robust monitoring tools help detect issues early and crucially reduce MTTR, bringing down the cost to systems, people and business. Catchpoint solutions support this process across the DevOps lifecycle.

Don’t forget to run experiments. After you’ve QA’ed everything you can think of, one practice that is extremely helpful is to do a slow push in production. Can you push the change to <1% of your fleet? Then slowly ramp up to 10% then 50% then 100%? Make sure to put error checking and validation in place to see how well your experiment is doing.  

#5 - Organizations must prioritize resilience

As organizations increasingly depend on complex, interdependent IT systems, it is essential that they prioritize resilience. A plan for digital resilience at all levels of the org is essential that ensures it is actually implemented.  

One way of doing this for large and medium-sized orgs that we’ve written about elsewhere is to implement a Chief Resilience Officer, or CRO. Another for smaller orgs to consider is to develop a team inside your company responsible for resilience and alongside this, institute a C suite sponsor. Unless resilience is placed at an executive level of importance and a continuous focus applied, the risk of a poorly run change and incident management process will worsen. Make the effort now (before failure happens) to ensure that resilience is embedded into every aspect of your organization. The goal: to make sure that your organization can quickly recover from disruptions and withstand them, safeguarding its reputation, the bottom line, and your ability to provide service again to your users as quickly as possible.  

As well as fostering IT resilience in your infrastructure, applications and services, you also need to actively foster cultural resilience within your teams.

Employees need regular training on best practices in change management and incident response. Clear, well-documented processes need to be in place – and followed – for handling updates and changes. A just, blameless culture is also important. As we saw in The SRE Report 2023, it’s not just a matter of engendering a just culture for the sake of it, our data revealed that the majority of Elite organizations (per DORA) are “very” or “extremely” blameless.  

The scale of this outage is going to cause a lot of internal debate and very likely for IT/business pendulums to swing to extremes in an effort to stem future ramifications at scale. It will be useful to have a focused team of tech and business leaders in place to help protect internal teams from swinging too far in any direction and focusing jointly on what makes sense for your specific business and/or area of operations.

Don’t take digital resilience for granted

As we said, the ramifications of this outage are still unfolding. Nonetheless, we all know one thing: there will be another not far behind. Digital resilience cannot be taken for granted.

Assess the current state of digital and Internet Resilience in Catchpoint’s inaugural report: https://www.catchpoint.com/asset/internet-resilience-report-2024 (No registration required)

Featured image source: Smishra1, CC BY-SA 4.0, via Wikimedia Commons

Summary

“You can’t start a fire without a spark,” so sings The Boss. You can’t run an organization these days without digital dependencies. Nor – or perhaps because of this - can we outrun digital failure.

Whether it’s on the global scale of CrowdStrike crashing 8.5 million devices - and with it bringing a halt to entire industries - or a smaller yet still costly issue like the cart stopping working on your eCommerce site thanks to your cloud provider going down (read our perspective on both of Friday’s outages here).

Our Head of Operations issued internal teams a reminder today that there is no easy fix to these kinds of issues, such as simply deciding to remove single points of failure. “True multi-vendor always sounds easier than it actually is,” he counseled – whether that’s running Linux and Windows for the same functions or attempting to split endpoint security between two different vendors… Even less easy are the issues being hotly discussed in the news around concentration and consolidation of large tech solutions.  

One thing that is painfully clear and an easy cause for agreement, however, is that this will have caused sleepless nights among many, many IT departments and those teams merit our thanks!

Major IT Outage hits banks, airlines, businesses worldwide

BSOD screens at an airport in New Delhi, India (Photo by Amarjeet Kumar Singh/Anadolu via Getty Images)

Still in progress… 5 takeaways from the CrowdStrike outage

So while the lessons from the CrowdStrike outage are unfolding in real time as recovery efforts, which could take weeks, continue; and yes, both CrowdStrike and Microsoft have quickly put out guidance for recovery… There are nonetheless several general takeaways already evident. Some of ours here:

#1 - Everything is digital

As a result of CrowdStrike’s faulty software update, airlines were grounded (some still are), and banks, schools, governments, and businesses around the world have all been impacted. Furthermore, when societal functions as important as healthcare and the emergency services are underpinned by digital systems – and fail because of it - the need for digital resilience is made startlingly clear.  

#2 - We are all digitally interdependent (and can’t escape failure)

In Catchpoint’s Internet Resilience Report 2024, we interviewed over 300 global digital leaders about digital and Internet Resilience. One of the questions we put to the field was about their reliance on third-party providers. All respondents, bar 1%, said they had some reliance on third-party platform technology providers and 77% said this was extremely or highly critical to their digital or Internet Resilience success.

We can’t remove these dependencies. They are so multiple and so intertwined to enable our sites and applications to function and our machines to stay secure, etc. And we also know they will fail at some point since, as Werner Vogels, CTO of AWS infamously stated, “Everything fails all the time.” However, your IT teams can chart your dependencies and monitor them as thoroughly as you can. Catchpoint’s Internet Stack Map can help you do that for your applications and services.

It can be seemingly the smallest of issues that cause big failures. This happened to us a few years ago with an unforeseen issue caused by an update to our Let’s Encrypt certificates. It could be as easily a BGP, CDN or DNS issue you may be unaware of.

#3 - Be prepared (as best you can) for failure

If you know that a key provider is planning a major system update in a week or two’s time, you should aim to ensure that you will be able to stay resilient and bounce back quickly if there are problems. To do this, ready your teams with:

  1. A crisis call plan (who will be on the call, what steps should likely be taken, who to be contacted with specific issues, etc.)
  2. A clear understanding of the consequences of any failure
  3. A monitoring and observability plan that covers all bases
  4. A communications plan and easy-to-populate templates for how to share difficult information with customers, users and the wider public
  5. A process in place for effective postmortems
“There will be a lot of conversations around "trust but verify" on allowing auto-updating of tools and OSes. I know security teams would love it, but in terms of preventing this type of outage in particular, a staged rollout with verification before continuing should be the modus operandi goal.”

Tony Ferrelli, VP, Operations, Catchpoint

#4 - Test, baby, test… and monitor continuously!

When making software updates or change configurations, test extensively on a variety of systems before deployment. Testing needs to be implemented in testing environments that emulate real-world scenarios - to include older systems that might be in use by clients. Actually, in the unconfirmed instance of Southwest whose planes have stayed aloft, unaffected by the global outage, it may be because they’re using Windows 3.1. Although, this equally may be a joke... especially considering Windows 3.1 is from 1992.  

At any rate, the outage highlights a painful gap in CrowdStrike’s testing and validation processes.  

Put in place monitoring tools that can help detect issues early. Test before, during and after deployment and monitor what happens so that you have clear benchmarks to understand exactly what is happening. Robust monitoring tools help detect issues early and crucially reduce MTTR, bringing down the cost to systems, people and business. Catchpoint solutions support this process across the DevOps lifecycle.

Don’t forget to run experiments. After you’ve QA’ed everything you can think of, one practice that is extremely helpful is to do a slow push in production. Can you push the change to <1% of your fleet? Then slowly ramp up to 10% then 50% then 100%? Make sure to put error checking and validation in place to see how well your experiment is doing.  

#5 - Organizations must prioritize resilience

As organizations increasingly depend on complex, interdependent IT systems, it is essential that they prioritize resilience. A plan for digital resilience at all levels of the org is essential that ensures it is actually implemented.  

One way of doing this for large and medium-sized orgs that we’ve written about elsewhere is to implement a Chief Resilience Officer, or CRO. Another for smaller orgs to consider is to develop a team inside your company responsible for resilience and alongside this, institute a C suite sponsor. Unless resilience is placed at an executive level of importance and a continuous focus applied, the risk of a poorly run change and incident management process will worsen. Make the effort now (before failure happens) to ensure that resilience is embedded into every aspect of your organization. The goal: to make sure that your organization can quickly recover from disruptions and withstand them, safeguarding its reputation, the bottom line, and your ability to provide service again to your users as quickly as possible.  

As well as fostering IT resilience in your infrastructure, applications and services, you also need to actively foster cultural resilience within your teams.

Employees need regular training on best practices in change management and incident response. Clear, well-documented processes need to be in place – and followed – for handling updates and changes. A just, blameless culture is also important. As we saw in The SRE Report 2023, it’s not just a matter of engendering a just culture for the sake of it, our data revealed that the majority of Elite organizations (per DORA) are “very” or “extremely” blameless.  

The scale of this outage is going to cause a lot of internal debate and very likely for IT/business pendulums to swing to extremes in an effort to stem future ramifications at scale. It will be useful to have a focused team of tech and business leaders in place to help protect internal teams from swinging too far in any direction and focusing jointly on what makes sense for your specific business and/or area of operations.

Don’t take digital resilience for granted

As we said, the ramifications of this outage are still unfolding. Nonetheless, we all know one thing: there will be another not far behind. Digital resilience cannot be taken for granted.

Assess the current state of digital and Internet Resilience in Catchpoint’s inaugural report: https://www.catchpoint.com/asset/internet-resilience-report-2024 (No registration required)

Featured image source: Smishra1, CC BY-SA 4.0, via Wikimedia Commons

This is some text inside of a div block.