惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

Catchpoint Blog

SRE Report: AI optimism and the economics of effort SRE Report: Why fast is what users trust The SRE Report 2026: Defensible Ns SRE Report 2026: What surprised us, what didn't, and why the gaps matter most Why Synthetic Tracing Delivers Better Data, Not Just More Data A New Chapter: LogicMonitor + Catchpoint – A Personal Note from Mehdi Mezmo + Catchpoint deliver observability SREs can rely on The four pillars holding up your digital business, and what happens when they crumble When payments pause: lessons from a global payments outage Observability 2025 Decoded: What the DZone Report Means for SLO-Driven Ops The next evolution of WebPageTest has arrived, and it’s a game-changer The Monitoring Blind Spot That Could Cost You Black Friday Powering Mexico’s Digital Future: Expanded Internet Observability with Catchpoint The Next Chapter of WebPageTest: Your New Experience Starts Soon SRE Report Retrospectives — Have AIOps Predictions Held Up? When BGP becomes UX: The inside story of a SaaS routing decision gone wrong (or right) Session Replay explained: A guide to seeing digital experience through your user’s eyes Making the invisible visible: Are your cloud firewalls and DDoS protection really working? Why it’s time to move beyond APM: Monitoring from the user’s perspective When metrics mislead: Inside the 2025 Retail Web Performance Benchmark The vendor trap: why your next outage won’t be your fault—but will be your problem LLMs don’t stand still: How to monitor and trust the models powering your AI Semantic Caching: What We Measured, Why It Matters The Annual SRE Survey Is Open—We Want to Hear from You Observability isn’t about the tool. It’s about the truth Invisible dependencies, visible impact: Lessons from the Google Cloud outage Real-time detection of BGP blackholing and prefix hijacks Leading analyst firm reveals the real cost of internet disruptions The Power of Over 3000 Intelligent Observability Agents Monitoring in the Age of Complexity: 5 Assumptions CIOs Need to Rethink Why Intelligent Traffic Steering is Critical for Performance and Cost Optimization Retail digital performance event recap: Key insights from IBM & Catchpoint Zendesk outage: A case for proactive monitoring and faster incident response Silence during chaos: Why the X outage is a call to arms for proactive monitoring The $1 Million Lesson: Building a Culture of Quality Through SLAs When AI tools fail: How to map your AI dependencies for proactive visibility Why Super Bowl 2025 was a triumph for Internet Resilience Why Internet Performance Monitoring is the new health check for IT organizations Why use Playwright in Catchpoint for synthetic monitoring Introducing WebPageTest Expert Plan: Real-Time Insights, Synthetic + RUM together in One Platform The shift to digital: How businesses are reshaping their priorities for 2025 The SRE Report 2025's Call to Action Monitoring in the Age of the Internet: DEM, IPM, and APM—What You Need to Know SSL Monitoring, Trust, and McLOVIN Performing for the holidays: Look beyond uptime for season sales success Lessons from Microsoft’s office 365 Outage: The Importance of third-party monitoring Web Performance Experts Look into the Future of Web Performance The hidden challenges of Internet Resilience: Key insights from 2024 report When SSL Issues aren’t just about SSL: A deep dive into the TIBCO Mashery outage The curious case of Marriott and the untold impact of web performance on revenue Preparing for the unexpected: Lessons from the AJIO and Jio Outage It’s time to stop neglecting the elephant in the room: Performance Matters! The Need for Speed: Highlights from IBM and Catchpoint’s Global DNS Performance Study Webinar Recap: Taking Web Performance to the Next Level Use the Catchpoint Terraform Provider in your CI/CD workflows Is the Internet ready for L4S? Takeaways from the CrowdStrike outage: third-parties can pose risk July 19th global IT outage reminds us of digital complexity Agentic AI: Powerful But Fragile—What You Need to Know Demystifying API Monitoring and Testing with IPM Cloudflare outage: another wake-up call for resilience planning Cloudflare’s Resolver Outage: More Than Just DNS Cloud Monitoring's Blind Spot: The User Perspective Connected Devices: Unlocking the next frontier of Internet Performance Monitoring Consolidation and Modernization in Enterprise Observability Catchpoint named a leader in the 2024 Gartner® Magic Quadrant™ for Digital Experience Monitoring Catchpoint Peak Performance Summit 2025: Redefining Observability for the Outcome Economy Catchpoint Expands Observability Network to Barcelona: A Growing Internet Hub Catch frustration before it costs you: New tools for a better user experience Creating the IPM Category: Catchpoint’s Journey to Leadership and the LogicMonitor Era AWS Outage: How do you prepare for the failure of your own safety net? Achieving stability with agility in your CI/CD pipeline APM vs Observability: Observing beyond APM APM vs Observability: What comes next? APM vs observability: why your definitions are broken AppAssure: Ensuring the resilience of your Tier-1 applications just became easier APM vs Observability: Both-and, not either-or 2024: A banner year for Internet Resilience 5 Actions you can take to improve digital performance Fast and furious: The importance of performance in the digital age How SAP achieved world-class uptime through modern observability How AI Turns Monitoring From “What Now?” Into “What’s Next?” How IPM helped a top tech brand catch an OpenAI outage before it became a crisis Google’s Agent-to-Agent (A2A) Protocol is here—Now Let’s Make it Observable Here’s the proof: What the fastest sites on the web have in common Going for gold: Testing the resilience of Olympic websites From SEO to AEO: Why Web Performance Is the Key to AI Search Success From the source to the edge: the six agent types you can’t ignore Getting Started with Traceroute How to Monitor AI Agents in Commerce Systems From refresh to results: the metrics that shaped Election Day 2024 coverage Escalating risk, shrinking margins: The 2025 Internet Resilience Report Don’t get caught in the dark: Lessons from a Lumen & AWS micro-outage ECN explained: Navigate congestion for faster, smoother data delivery DNS misconfiguration can happen to anyone - the question is how fast can you detect it? Diagnosing Wi-Fi failures that traditional tools miss: a case study Did Delta's slow web performance signal trouble before CrowdStrike? Customer Survey 2024: Unveiling insights and impact Critical Requirements for Modern API Monitoring
Learnings from ServiceNow’s Proactive Response to a Network Breakdown
2024-08-23 · via Catchpoint Blog

ServiceNow is undoubtedly one of the leading players in the fields of IT service management (ITSM), IT operations management (ITOM), and IT business management (ITBM). When they experience an outage or service interruption, it impacts thousands.

The indirect and induced impacts have a multiplier effect on the larger IT ecosystem.

Think about it. If a workflow is disrupted because of an outage, then there are large and wide ripple effects. For example:

  • IT teams are not able to deliver resilient services and experience levels to their workforce. This has the risk of breaching experience level agreements. And the employee who was relying on IT also had their day continue to be disrupted.
  • Security teams are not able to respond to threats and vulnerabilities. This has the risk of increased exposure and the possibility of not adhering to strict governance and compliance mandates.
  • Application owners and developers’ automation tasks screech to a halt. This has the risk of missing an important release, which in turn may involve additional meetings or trust erosion for customers who were expecting a product update.
  • Finance and procurement are not able to hire suppliers for products and services or, pay for them. This has the risk of e.g., manufacturing plants expensive assembly lines, well, not being able to assemble.

The list goes on.  

Unfortunately, ServiceNow recently experienced such a type of incident.

We conducted an analysis using Catchpoint’s Internet Performance Monitoring (IPM) data. We can see that ServiceNow took proactive steps to lessen the duration and impact of what could have been a much larger, impactful incident.  

Let’s dissect.

What happened?

On 15th Aug 2024 at 14:15 PM ET, ServiceNow’s core services were down with reports highlighting intermittent success based on the connectivity with the upstream providers. Failures were reported till 16:18 ET enveloping a timeframe of 2 hours 3 mins. This outage not only impacted ServiceNow’s portal resources, but their client integrations were impacted as well.

Scatterplot highlights intermittent failures due to Network Outage

Catchpoint’s Internet Sonar started triggering alerts while synthetically correlating thresholds against the already existing test implementation. The Internet Sonar Dashboard dynamically populated outage data with Response and Connection Timeout errors from major geographic locations. Observing the outage trend, we found resources to be intermittently reachable while the majority of the requests were facing high connect time.

Catchpoint Internet Sonar Dashboard of ServiceNow incident

Waterfall data highlighting High Connection Time and Internet Sonar correlation with ServiceNow’s regional outage

This outage resulted from instability in ServiceNow's connectivity with its upstream providers, particularly AS 6461 | Zayo.

We observed this behavior in the Catchpoint portal with our Traceroute monitor.  

15th Aug 11:00 EST to 14:00 EST, before the outage:

15th Aug 14:15 EST to 16:20 EST, during the outage:

15th Aug 16:20 EST to 18:00 EST, after the outage:

ServiceNow uses multiple ISPs for its datacenter locations as listed in this article: https://support.servicenow.com/kb?id=kb_article_view&sysparm_article=KB0547560. Out of the listed ISPs, the core ISPs which have a direct neighbor-ship with ServiceNow are Lumen (3356), Cogent (174), Zayo (6461), Level3 (3356), AT&T (7018) and Verizon (6167).  

Before the outage, AS 6461 | Zayo was one of the most favorable for incoming traffic for ServiceNow:  

RIPEstat: IPV4/6 ASN Neighbours of ServiceNow

But as soon as Zayo started having some major issues, it resulted in path fluctuations, which ultimately led ServiceNow Team to go through multiple BGP events (Announcements/Re-announcements/Withdrawals during the process).  

RIPEstat BGP activity: Spike in # of announcements and withdrawals

Let’s breakdown the above highlighted BGP activity from RIPEstat in three scenarios:  

  • Before the outage
  • During the outage
  • After the outage

Before the outage (Aug 14 between 00:00 – 23:59 UTC) - we observe a total of 178 events taking place for AS 16839. These events give us a perspective of BGP activity within SNC ASN (including neighboring changes):  

During the outage on 15th Aug (if we compare it with a day before), we observed the number of events increased drastically to 491 with lot of route withdrawals and re-announcements.

This abnormality highlights the network volatility (basically the changes made by ServiceNow team manually/using automation to make sure their services are still reachable from the internet) while the changes were being made, ServiceNow portal and partner integration continued to have connectivity issues.  

As we keep a close eye to this affected network, we observe that after the outage we see ServiceNow ASN is no longer connected/receiving traffic directly through Zayo - which kind of hints at issues specifically at Zayo-ServiceNow link during the incident. BGP did its job, and traffic found a reliable way to get to destination via other providers of ServiceNow.  

Eventually, the issue with Zayo-ServiceNow link was solved 10 hours (15th Aug, 20:25 UTC – 16th Aug, 06:28 UTC) after the first hints of the incident, and traffic started to be routed through the originally preferred links:  

Viewing the network outage in the Catchpoint platform

In addition to the Internet Sonar and synthetic test alerts, we saw in this in the Catchpoint portal:

  1. Initial Re-announcements & Prepending
  2. Path Fluctuations & Community Tags
  3. Route Withdrawals & Re-announcements
  4. Path Changes Involving Prepending & Path Optimization
  5. Final Stabilization

At 15:51 EST, though at a lower level, we saw the services restoring back to normal as ServiceNow Team, based on the BGP event, rolled back the routes and redirected the traffic to alternate IPs:

As highlighted in the snippet above, we can observe the traffic routing, before the issue, during the issue, and after the issue. We can see the redirection in real-time as requests are routed to a new of IP, 149.95.29.217, instead of the original IP 149.95.45.217 (part of subnet 149.95.16.0/20). This was done in a process while mitigating the impact, deprioritizing traffic via Zayo using BGP updates.

Lessons learned

Even though we have less control over the Internet, there are highlights from this outage. When BGP events take place, not taking proactive, necessary actions will eventually lead to large, long outages. The ServiceNow Team took necessary steps based on the network fluctuation observed. They restored connectivity to core resources and client integrations.

This outage can serve as a great learning experience for a lot of organizations to:

  • Review their monitoring strategy for everything in between their end users and their content.
  • Identify fall back mechanisms.
  • Monitor and hold their vendor accountable.
  • Review their Incident Management practices and those of their vendors.
  • Remember to test mitigation plans.  

In today's distributed environment, the application delivery chain is made up of numerous disparate but interdependent parts, and incidents like this demonstrate the impact a network outage can have on your infrastructure - DNS, load balancers, CDN, cloud infrastructure, datacenters, and so on - but most importantly on your end-user experience and overall business.

Summary

  • Organizations must make sure they are monitoring each service as well as the Network blanket. Major ISPs will inevitably experience downtime, which costs businesses globally millions of dollars in lost revenue, lost productivity, and decreased service dependability each time it occurs.
  • An outage, be it micro or major, could be tied to a microservice or to the failure of the infrastructure.
  • It is imperative for these service providers to maintain SLAs. Customers cannot rely on a vendor's assurance of service levels alone. These kinds of outages lead to SLA breaches, and you could not be liable for fines if you don't have data to substantiate the outage.
  • By implementing end-to-end incident management, pro-active monitoring significantly decreases MTTD.
  • To facilitate expedited and seamless processing of the incidents, Catchpoint provides a comprehensive view of key asset data, indicators, historical data, etc.
  • Strategic proactive monitoring increases the effectiveness of Ops, SRE, and SOC / NOC teams by capturing and assimilating multi-source main assets, metric data and point to microservices or any of the moving parts of the delivery chain, to help minimize the MTTR window to a greater extent.

Summary

ServiceNow is undoubtedly one of the leading players in the fields of IT service management (ITSM), IT operations management (ITOM), and IT business management (ITBM). When they experience an outage or service interruption, it impacts thousands.

The indirect and induced impacts have a multiplier effect on the larger IT ecosystem.

Think about it. If a workflow is disrupted because of an outage, then there are large and wide ripple effects. For example:

  • IT teams are not able to deliver resilient services and experience levels to their workforce. This has the risk of breaching experience level agreements. And the employee who was relying on IT also had their day continue to be disrupted.
  • Security teams are not able to respond to threats and vulnerabilities. This has the risk of increased exposure and the possibility of not adhering to strict governance and compliance mandates.
  • Application owners and developers’ automation tasks screech to a halt. This has the risk of missing an important release, which in turn may involve additional meetings or trust erosion for customers who were expecting a product update.
  • Finance and procurement are not able to hire suppliers for products and services or, pay for them. This has the risk of e.g., manufacturing plants expensive assembly lines, well, not being able to assemble.

The list goes on.  

Unfortunately, ServiceNow recently experienced such a type of incident.

We conducted an analysis using Catchpoint’s Internet Performance Monitoring (IPM) data. We can see that ServiceNow took proactive steps to lessen the duration and impact of what could have been a much larger, impactful incident.  

Let’s dissect.

What happened?

On 15th Aug 2024 at 14:15 PM ET, ServiceNow’s core services were down with reports highlighting intermittent success based on the connectivity with the upstream providers. Failures were reported till 16:18 ET enveloping a timeframe of 2 hours 3 mins. This outage not only impacted ServiceNow’s portal resources, but their client integrations were impacted as well.

Scatterplot highlights intermittent failures due to Network Outage

Catchpoint’s Internet Sonar started triggering alerts while synthetically correlating thresholds against the already existing test implementation. The Internet Sonar Dashboard dynamically populated outage data with Response and Connection Timeout errors from major geographic locations. Observing the outage trend, we found resources to be intermittently reachable while the majority of the requests were facing high connect time.

Catchpoint Internet Sonar Dashboard of ServiceNow incident

Waterfall data highlighting High Connection Time and Internet Sonar correlation with ServiceNow’s regional outage

This outage resulted from instability in ServiceNow's connectivity with its upstream providers, particularly AS 6461 | Zayo.

We observed this behavior in the Catchpoint portal with our Traceroute monitor.  

15th Aug 11:00 EST to 14:00 EST, before the outage:

15th Aug 14:15 EST to 16:20 EST, during the outage:

15th Aug 16:20 EST to 18:00 EST, after the outage:

ServiceNow uses multiple ISPs for its datacenter locations as listed in this article: https://support.servicenow.com/kb?id=kb_article_view&sysparm_article=KB0547560. Out of the listed ISPs, the core ISPs which have a direct neighbor-ship with ServiceNow are Lumen (3356), Cogent (174), Zayo (6461), Level3 (3356), AT&T (7018) and Verizon (6167).  

Before the outage, AS 6461 | Zayo was one of the most favorable for incoming traffic for ServiceNow:  

RIPEstat: IPV4/6 ASN Neighbours of ServiceNow

But as soon as Zayo started having some major issues, it resulted in path fluctuations, which ultimately led ServiceNow Team to go through multiple BGP events (Announcements/Re-announcements/Withdrawals during the process).  

RIPEstat BGP activity: Spike in # of announcements and withdrawals

Let’s breakdown the above highlighted BGP activity from RIPEstat in three scenarios:  

  • Before the outage
  • During the outage
  • After the outage

Before the outage (Aug 14 between 00:00 – 23:59 UTC) - we observe a total of 178 events taking place for AS 16839. These events give us a perspective of BGP activity within SNC ASN (including neighboring changes):  

During the outage on 15th Aug (if we compare it with a day before), we observed the number of events increased drastically to 491 with lot of route withdrawals and re-announcements.

This abnormality highlights the network volatility (basically the changes made by ServiceNow team manually/using automation to make sure their services are still reachable from the internet) while the changes were being made, ServiceNow portal and partner integration continued to have connectivity issues.  

As we keep a close eye to this affected network, we observe that after the outage we see ServiceNow ASN is no longer connected/receiving traffic directly through Zayo - which kind of hints at issues specifically at Zayo-ServiceNow link during the incident. BGP did its job, and traffic found a reliable way to get to destination via other providers of ServiceNow.  

Eventually, the issue with Zayo-ServiceNow link was solved 10 hours (15th Aug, 20:25 UTC – 16th Aug, 06:28 UTC) after the first hints of the incident, and traffic started to be routed through the originally preferred links:  

Viewing the network outage in the Catchpoint platform

In addition to the Internet Sonar and synthetic test alerts, we saw in this in the Catchpoint portal:

  1. Initial Re-announcements & Prepending
  2. Path Fluctuations & Community Tags
  3. Route Withdrawals & Re-announcements
  4. Path Changes Involving Prepending & Path Optimization
  5. Final Stabilization

At 15:51 EST, though at a lower level, we saw the services restoring back to normal as ServiceNow Team, based on the BGP event, rolled back the routes and redirected the traffic to alternate IPs:

As highlighted in the snippet above, we can observe the traffic routing, before the issue, during the issue, and after the issue. We can see the redirection in real-time as requests are routed to a new of IP, 149.95.29.217, instead of the original IP 149.95.45.217 (part of subnet 149.95.16.0/20). This was done in a process while mitigating the impact, deprioritizing traffic via Zayo using BGP updates.

Lessons learned

Even though we have less control over the Internet, there are highlights from this outage. When BGP events take place, not taking proactive, necessary actions will eventually lead to large, long outages. The ServiceNow Team took necessary steps based on the network fluctuation observed. They restored connectivity to core resources and client integrations.

This outage can serve as a great learning experience for a lot of organizations to:

  • Review their monitoring strategy for everything in between their end users and their content.
  • Identify fall back mechanisms.
  • Monitor and hold their vendor accountable.
  • Review their Incident Management practices and those of their vendors.
  • Remember to test mitigation plans.  

In today's distributed environment, the application delivery chain is made up of numerous disparate but interdependent parts, and incidents like this demonstrate the impact a network outage can have on your infrastructure - DNS, load balancers, CDN, cloud infrastructure, datacenters, and so on - but most importantly on your end-user experience and overall business.

Summary

  • Organizations must make sure they are monitoring each service as well as the Network blanket. Major ISPs will inevitably experience downtime, which costs businesses globally millions of dollars in lost revenue, lost productivity, and decreased service dependability each time it occurs.
  • An outage, be it micro or major, could be tied to a microservice or to the failure of the infrastructure.
  • It is imperative for these service providers to maintain SLAs. Customers cannot rely on a vendor's assurance of service levels alone. These kinds of outages lead to SLA breaches, and you could not be liable for fines if you don't have data to substantiate the outage.
  • By implementing end-to-end incident management, pro-active monitoring significantly decreases MTTD.
  • To facilitate expedited and seamless processing of the incidents, Catchpoint provides a comprehensive view of key asset data, indicators, historical data, etc.
  • Strategic proactive monitoring increases the effectiveness of Ops, SRE, and SOC / NOC teams by capturing and assimilating multi-source main assets, metric data and point to microservices or any of the moving parts of the delivery chain, to help minimize the MTTR window to a greater extent.

This is some text inside of a div block.