惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

Catchpoint Blog

SRE Report: AI optimism and the economics of effort SRE Report: Why fast is what users trust The SRE Report 2026: Defensible Ns SRE Report 2026: What surprised us, what didn't, and why the gaps matter most Why Synthetic Tracing Delivers Better Data, Not Just More Data A New Chapter: LogicMonitor + Catchpoint – A Personal Note from Mehdi Mezmo + Catchpoint deliver observability SREs can rely on The four pillars holding up your digital business, and what happens when they crumble When payments pause: lessons from a global payments outage Observability 2025 Decoded: What the DZone Report Means for SLO-Driven Ops The next evolution of WebPageTest has arrived, and it’s a game-changer The Monitoring Blind Spot That Could Cost You Black Friday Powering Mexico’s Digital Future: Expanded Internet Observability with Catchpoint The Next Chapter of WebPageTest: Your New Experience Starts Soon SRE Report Retrospectives — Have AIOps Predictions Held Up? When BGP becomes UX: The inside story of a SaaS routing decision gone wrong (or right) Session Replay explained: A guide to seeing digital experience through your user’s eyes Making the invisible visible: Are your cloud firewalls and DDoS protection really working? Why it’s time to move beyond APM: Monitoring from the user’s perspective When metrics mislead: Inside the 2025 Retail Web Performance Benchmark The vendor trap: why your next outage won’t be your fault—but will be your problem LLMs don’t stand still: How to monitor and trust the models powering your AI Semantic Caching: What We Measured, Why It Matters The Annual SRE Survey Is Open—We Want to Hear from You Observability isn’t about the tool. It’s about the truth Invisible dependencies, visible impact: Lessons from the Google Cloud outage Real-time detection of BGP blackholing and prefix hijacks Leading analyst firm reveals the real cost of internet disruptions The Power of Over 3000 Intelligent Observability Agents Monitoring in the Age of Complexity: 5 Assumptions CIOs Need to Rethink Why Intelligent Traffic Steering is Critical for Performance and Cost Optimization Retail digital performance event recap: Key insights from IBM & Catchpoint Zendesk outage: A case for proactive monitoring and faster incident response Silence during chaos: Why the X outage is a call to arms for proactive monitoring The $1 Million Lesson: Building a Culture of Quality Through SLAs When AI tools fail: How to map your AI dependencies for proactive visibility Why Super Bowl 2025 was a triumph for Internet Resilience Why Internet Performance Monitoring is the new health check for IT organizations Why use Playwright in Catchpoint for synthetic monitoring Introducing WebPageTest Expert Plan: Real-Time Insights, Synthetic + RUM together in One Platform The shift to digital: How businesses are reshaping their priorities for 2025 The SRE Report 2025's Call to Action Monitoring in the Age of the Internet: DEM, IPM, and APM—What You Need to Know SSL Monitoring, Trust, and McLOVIN Performing for the holidays: Look beyond uptime for season sales success Lessons from Microsoft’s office 365 Outage: The Importance of third-party monitoring Web Performance Experts Look into the Future of Web Performance The hidden challenges of Internet Resilience: Key insights from 2024 report When SSL Issues aren’t just about SSL: A deep dive into the TIBCO Mashery outage The curious case of Marriott and the untold impact of web performance on revenue Preparing for the unexpected: Lessons from the AJIO and Jio Outage It’s time to stop neglecting the elephant in the room: Performance Matters! The Need for Speed: Highlights from IBM and Catchpoint’s Global DNS Performance Study Learnings from ServiceNow’s Proactive Response to a Network Breakdown Webinar Recap: Taking Web Performance to the Next Level Use the Catchpoint Terraform Provider in your CI/CD workflows Is the Internet ready for L4S? Takeaways from the CrowdStrike outage: third-parties can pose risk Agentic AI: Powerful But Fragile—What You Need to Know Demystifying API Monitoring and Testing with IPM Cloudflare outage: another wake-up call for resilience planning Cloudflare’s Resolver Outage: More Than Just DNS Cloud Monitoring's Blind Spot: The User Perspective Connected Devices: Unlocking the next frontier of Internet Performance Monitoring Consolidation and Modernization in Enterprise Observability Catchpoint named a leader in the 2024 Gartner® Magic Quadrant™ for Digital Experience Monitoring Catchpoint Peak Performance Summit 2025: Redefining Observability for the Outcome Economy Catchpoint Expands Observability Network to Barcelona: A Growing Internet Hub Catch frustration before it costs you: New tools for a better user experience Creating the IPM Category: Catchpoint’s Journey to Leadership and the LogicMonitor Era AWS Outage: How do you prepare for the failure of your own safety net? Achieving stability with agility in your CI/CD pipeline APM vs Observability: Observing beyond APM APM vs Observability: What comes next? APM vs observability: why your definitions are broken AppAssure: Ensuring the resilience of your Tier-1 applications just became easier APM vs Observability: Both-and, not either-or 2024: A banner year for Internet Resilience 5 Actions you can take to improve digital performance Fast and furious: The importance of performance in the digital age How SAP achieved world-class uptime through modern observability How AI Turns Monitoring From “What Now?” Into “What’s Next?” How IPM helped a top tech brand catch an OpenAI outage before it became a crisis Google’s Agent-to-Agent (A2A) Protocol is here—Now Let’s Make it Observable Here’s the proof: What the fastest sites on the web have in common Going for gold: Testing the resilience of Olympic websites From SEO to AEO: Why Web Performance Is the Key to AI Search Success From the source to the edge: the six agent types you can’t ignore Getting Started with Traceroute How to Monitor AI Agents in Commerce Systems From refresh to results: the metrics that shaped Election Day 2024 coverage Escalating risk, shrinking margins: The 2025 Internet Resilience Report Don’t get caught in the dark: Lessons from a Lumen & AWS micro-outage ECN explained: Navigate congestion for faster, smoother data delivery DNS misconfiguration can happen to anyone - the question is how fast can you detect it? Diagnosing Wi-Fi failures that traditional tools miss: a case study Did Delta's slow web performance signal trouble before CrowdStrike? Customer Survey 2024: Unveiling insights and impact Critical Requirements for Modern API Monitoring
July 19th global IT outage reminds us of digital complexity
2024-07-19 · via Catchpoint Blog

As we write, on Friday July 19th, a massive global cyber outage is continuing to take down critical services around the world dependent on Microsoft-based computers.  

In what appears to be one of the biggest outages ever, daily life is being impacted around the world on a micro scale (in the UK, for instance, local doctors are seeing only very ill patients and writing up their notes by hand) to macro – grounding major airlines, taking emergency services offline, and preventing major banks and enterprises from doing business.  

Yesterday, CrowdStrike released an update that began impacting IT systems globally. We are aware of this issue and are working closely with CrowdStrike and across the industry to provide customers technical guidance and support to safely bring their systems back online.

— Satya Nadella (@satyanadella) July 19, 2024

Cybersecurity firm CrowdStrike has taken responsibility for the issue, blaming a faulty automatic software update, which has knocked affected Microsoft PCs and servers offline, forcing them into a recovery boot loop preventing their machines from properly starting.  

“The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient,” said Mehdi Daoudi, CEO and co-founder of Catchpoint. “It is also a reminder you need to manage and control change: Don't blindly update software or change configuration.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems—and consequently businesses—down.

Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today’s event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails. Kudos to all the IT professionals and teams who are working tirelessly to resolve this issue and restore services.”  

A second major outage within the last 24 hours

While everyone is looking at CrowdStrike (the exact scale, impact and ramifications of which are harder to detect from outside – since it’s caused by faulty software and not a service), Catchpoint caught a separate significant outage within the last 24 hours, which very likely impacted some companies twice.

This has caused widespread confusion in the media as various news sites have posted news relating the two issues when, in fact, they are independent of one another.

Any Internet-based services that were reliant on the Azure Central region and did not have multi-region or multi-cloud strategies in place reliant on Azure within the timeframe of the incident on July 18th in the Central US region would have been impacted, including knock-on dependencies, such as APIs used by eCommerce sites affecting site functionality. Let’s take a deeper look.

Catchpoint’s Internet Sonar detects initial set of issues impacting Azure

On Thursday July 18th, 2024, Catchpoint’s Internet Sonar detected the outage with Azure Services that disrupted critical services across the Central US region. The outage lasted from 18:37 to 22:17 EDT and led to numerous sites experiencing HTTP503 responses, particularly those using Azure Functions. Catchpoint data could rapidly isolate the issue to quickly confirm it was not related to network issues, saving network teams time and resources on unnecessary triage or further network-related troubleshooting.

Internet Sonar shows Azure Services outages impacting critical services across the Central US region (Internet Sonar/Catchpoint)

Major impact on Microsoft services

During this period, Microsoft 365 services were also impacted. Users encountered difficulties accessing a range of business-critical services, including SharePoint Online, OneDrive, Teams and other Microsoft services. 

A graph showing a number of dataDescription automatically generated with medium confidence

During the outage, assets stored on OneDrive were significantly affected, with users experiencing HTTP 503 responses when attempting to access these files.   

Microsoft Teams also faced disruptions during the outage. Users encountered HTTP503 response while accessing Teams on the browser.    

Impact to eCommerce providers

We also observed failures on API requests for some major e-commerce providers, which caused issues when users attempted to add products to their carts.   

A screenshot of a computerDescription automatically generated

Major outages reveal a complex digital world

In The Internet Resilience Report 2024, when we asked how critical third-party platform providers were to digital or Internet Resilience success, only 1% of respondents said there was no criticality at all. 77%, meanwhile, said third-party providers were extremely or highly critical to their Internet Resilience success.

The two major IT outages within the last 24 hours demonstrate once again how interdependent we are in today’s highly complex digital world. There are so many different operating systems in use, so many services. You never know when someone might bring you down, and you need to be ready for when they do. As these outages show, multiple things can fail, and the ramifications can be enormous when they do.  

3 key takeaways

#1 - Prepare for failure

It’s crucial to prepare ahead of time. The faster an outage is detected, the faster remediation efforts can begin to minimize the impact to the bottom line. Our customers tell us repeatedly that one of the primary reasons they work with Catchpoint is for proactive detection of outages and service degradations, which we are able to highlight often ahead of the vendor’s own announcements.  

#2 - Know your dependencies and monitor them

Chart your dependencies - you can use Catchpoint’s Internet Stack Map to do exactly that. CNN, for example, has over 600 dependencies for its homepage alone to load. As this latest outage proves at a massive scale, the Internet is not infallible. From security software to cloud services, we are clearly hugely reliant on third parties.  

One of the ways for your sysadmins and operations teams to get the sleep they deserve – and to mitigate the impact of incidents proactively – is to remove any monitoring gaps. Achieve Internet resilience by monitoring the output and performance of every component - external and internal.  

#3 - Trust and verify changes

These outages are a reminder you need to manage and control change. Perhaps the biggest takeaway of all: don't blindly update software or change configuration. Control software changes and always test before you globally deploy.

Ultimately, developing failover strategies for all your important services – across the spectrum, from security services to web performance – is essential in today’s complex interdependent digital world.  

Resources

Summary

As we write, on Friday July 19th, a massive global cyber outage is continuing to take down critical services around the world dependent on Microsoft-based computers.  

In what appears to be one of the biggest outages ever, daily life is being impacted around the world on a micro scale (in the UK, for instance, local doctors are seeing only very ill patients and writing up their notes by hand) to macro – grounding major airlines, taking emergency services offline, and preventing major banks and enterprises from doing business.  

Yesterday, CrowdStrike released an update that began impacting IT systems globally. We are aware of this issue and are working closely with CrowdStrike and across the industry to provide customers technical guidance and support to safely bring their systems back online.

— Satya Nadella (@satyanadella) July 19, 2024

Cybersecurity firm CrowdStrike has taken responsibility for the issue, blaming a faulty automatic software update, which has knocked affected Microsoft PCs and servers offline, forcing them into a recovery boot loop preventing their machines from properly starting.  

“The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient,” said Mehdi Daoudi, CEO and co-founder of Catchpoint. “It is also a reminder you need to manage and control change: Don't blindly update software or change configuration.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems—and consequently businesses—down.

Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today’s event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails. Kudos to all the IT professionals and teams who are working tirelessly to resolve this issue and restore services.”  

A second major outage within the last 24 hours

While everyone is looking at CrowdStrike (the exact scale, impact and ramifications of which are harder to detect from outside – since it’s caused by faulty software and not a service), Catchpoint caught a separate significant outage within the last 24 hours, which very likely impacted some companies twice.

This has caused widespread confusion in the media as various news sites have posted news relating the two issues when, in fact, they are independent of one another.

Any Internet-based services that were reliant on the Azure Central region and did not have multi-region or multi-cloud strategies in place reliant on Azure within the timeframe of the incident on July 18th in the Central US region would have been impacted, including knock-on dependencies, such as APIs used by eCommerce sites affecting site functionality. Let’s take a deeper look.

Catchpoint’s Internet Sonar detects initial set of issues impacting Azure

On Thursday July 18th, 2024, Catchpoint’s Internet Sonar detected the outage with Azure Services that disrupted critical services across the Central US region. The outage lasted from 18:37 to 22:17 EDT and led to numerous sites experiencing HTTP503 responses, particularly those using Azure Functions. Catchpoint data could rapidly isolate the issue to quickly confirm it was not related to network issues, saving network teams time and resources on unnecessary triage or further network-related troubleshooting.

Internet Sonar shows Azure Services outages impacting critical services across the Central US region (Internet Sonar/Catchpoint)

Major impact on Microsoft services

During this period, Microsoft 365 services were also impacted. Users encountered difficulties accessing a range of business-critical services, including SharePoint Online, OneDrive, Teams and other Microsoft services. 

A graph showing a number of dataDescription automatically generated with medium confidence

During the outage, assets stored on OneDrive were significantly affected, with users experiencing HTTP 503 responses when attempting to access these files.   

Microsoft Teams also faced disruptions during the outage. Users encountered HTTP503 response while accessing Teams on the browser.    

Impact to eCommerce providers

We also observed failures on API requests for some major e-commerce providers, which caused issues when users attempted to add products to their carts.   

A screenshot of a computerDescription automatically generated

Major outages reveal a complex digital world

In The Internet Resilience Report 2024, when we asked how critical third-party platform providers were to digital or Internet Resilience success, only 1% of respondents said there was no criticality at all. 77%, meanwhile, said third-party providers were extremely or highly critical to their Internet Resilience success.

The two major IT outages within the last 24 hours demonstrate once again how interdependent we are in today’s highly complex digital world. There are so many different operating systems in use, so many services. You never know when someone might bring you down, and you need to be ready for when they do. As these outages show, multiple things can fail, and the ramifications can be enormous when they do.  

3 key takeaways

#1 - Prepare for failure

It’s crucial to prepare ahead of time. The faster an outage is detected, the faster remediation efforts can begin to minimize the impact to the bottom line. Our customers tell us repeatedly that one of the primary reasons they work with Catchpoint is for proactive detection of outages and service degradations, which we are able to highlight often ahead of the vendor’s own announcements.  

#2 - Know your dependencies and monitor them

Chart your dependencies - you can use Catchpoint’s Internet Stack Map to do exactly that. CNN, for example, has over 600 dependencies for its homepage alone to load. As this latest outage proves at a massive scale, the Internet is not infallible. From security software to cloud services, we are clearly hugely reliant on third parties.  

One of the ways for your sysadmins and operations teams to get the sleep they deserve – and to mitigate the impact of incidents proactively – is to remove any monitoring gaps. Achieve Internet resilience by monitoring the output and performance of every component - external and internal.  

#3 - Trust and verify changes

These outages are a reminder you need to manage and control change. Perhaps the biggest takeaway of all: don't blindly update software or change configuration. Control software changes and always test before you globally deploy.

Ultimately, developing failover strategies for all your important services – across the spectrum, from security services to web performance – is essential in today’s complex interdependent digital world.  

Resources

This is some text inside of a div block.