惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

Catchpoint Blog

SRE Report: AI optimism and the economics of effort SRE Report: Why fast is what users trust The SRE Report 2026: Defensible Ns SRE Report 2026: What surprised us, what didn't, and why the gaps matter most Why Synthetic Tracing Delivers Better Data, Not Just More Data A New Chapter: LogicMonitor + Catchpoint – A Personal Note from Mehdi Mezmo + Catchpoint deliver observability SREs can rely on The four pillars holding up your digital business, and what happens when they crumble When payments pause: lessons from a global payments outage Observability 2025 Decoded: What the DZone Report Means for SLO-Driven Ops The next evolution of WebPageTest has arrived, and it’s a game-changer The Monitoring Blind Spot That Could Cost You Black Friday Powering Mexico’s Digital Future: Expanded Internet Observability with Catchpoint The Next Chapter of WebPageTest: Your New Experience Starts Soon SRE Report Retrospectives — Have AIOps Predictions Held Up? When BGP becomes UX: The inside story of a SaaS routing decision gone wrong (or right) Session Replay explained: A guide to seeing digital experience through your user’s eyes Making the invisible visible: Are your cloud firewalls and DDoS protection really working? Why it’s time to move beyond APM: Monitoring from the user’s perspective When metrics mislead: Inside the 2025 Retail Web Performance Benchmark The vendor trap: why your next outage won’t be your fault—but will be your problem LLMs don’t stand still: How to monitor and trust the models powering your AI Semantic Caching: What We Measured, Why It Matters The Annual SRE Survey Is Open—We Want to Hear from You Observability isn’t about the tool. It’s about the truth Invisible dependencies, visible impact: Lessons from the Google Cloud outage Real-time detection of BGP blackholing and prefix hijacks Leading analyst firm reveals the real cost of internet disruptions The Power of Over 3000 Intelligent Observability Agents Monitoring in the Age of Complexity: 5 Assumptions CIOs Need to Rethink Why Intelligent Traffic Steering is Critical for Performance and Cost Optimization Retail digital performance event recap: Key insights from IBM & Catchpoint Silence during chaos: Why the X outage is a call to arms for proactive monitoring The $1 Million Lesson: Building a Culture of Quality Through SLAs When AI tools fail: How to map your AI dependencies for proactive visibility Why Super Bowl 2025 was a triumph for Internet Resilience Why Internet Performance Monitoring is the new health check for IT organizations Why use Playwright in Catchpoint for synthetic monitoring Introducing WebPageTest Expert Plan: Real-Time Insights, Synthetic + RUM together in One Platform The shift to digital: How businesses are reshaping their priorities for 2025 The SRE Report 2025's Call to Action Monitoring in the Age of the Internet: DEM, IPM, and APM—What You Need to Know Creating the IPM Category: Catchpoint’s Journey to Leadership and the LogicMonitor Era 5 Actions you can take to improve digital performance 2024: A banner year for Internet Resilience APM vs Observability: Both-and, not either-or APM vs observability: why your definitions are broken APM vs Observability: What comes next? APM vs Observability: Observing beyond APM AWS Outage: How do you prepare for the failure of your own safety net? Agentic AI: Powerful But Fragile—What You Need to Know Catchpoint Peak Performance Summit 2025: Redefining Observability for the Outcome Economy Connected Devices: Unlocking the next frontier of Internet Performance Monitoring Cloud Monitoring's Blind Spot: The User Perspective Cloudflare’s Resolver Outage: More Than Just DNS Cloudflare outage: another wake-up call for resilience planning How to Monitor AI Agents in Commerce Systems Critical Requirements for Modern API Monitoring Diagnosing Wi-Fi failures that traditional tools miss: a case study Escalating risk, shrinking margins: The 2025 Internet Resilience Report Fast and furious: The importance of performance in the digital age Getting Started with Traceroute From the source to the edge: the six agent types you can’t ignore From SEO to AEO: Why Web Performance Is the Key to AI Search Success Here’s the proof: What the fastest sites on the web have in common Google’s Agent-to-Agent (A2A) Protocol is here—Now Let’s Make it Observable How IPM helped a top tech brand catch an OpenAI outage before it became a crisis How AI Turns Monitoring From “What Now?” Into “What’s Next?” How SAP achieved world-class uptime through modern observability
Zendesk outage: A case for proactive monitoring and faster incident response
2025-03-21 · via Catchpoint Blog

On March 20, 2025, starting at 15:43 AM UTC, Zendesk users globally encountered 503 “Service Unavailable” errors and 5xx server-side issues, disrupting access to critical support tools and communication channels. While immediate mitigations stabilized core services, intermittent issues continued for over 24 hours, underscoring the complexity of multi-pod infrastructure failures.

Timeline of Events

  • 15:29 AM UTC: Zendesk’s internal team confirmed user reports of access issues.
  • 15:50 AM UTC: Root cause identified as widespread 503 errors impacting multiple service pods.
  • March 21, 2025, 10:59 PM UTC (06:29 AM EDT): Zendesk’s status page confirmed recovery of “majority of issues,” with ongoing efforts to resolve lingering intermittent failures.

Error message on accessing Zendesk services

The 503 Service Unavailable responses were not immediately recognized as the root cause of the disruption. This delay hindered Zendesk’s ability to fully understand the scope and impact of the outage in a timely manner and slowed the process of restoring service and assisting customers.

When Zendesk went down, businesses felt the impact

The Zendesk outage crippled workflows for thousands of businesses relying on the platform for customer support, sales, and internal collaboration. With critical processes disrupted, teams struggled to deliver timely and effective customer service. Below are some of the key consequences:

#1 Access issues across multiple pods

Many users encountered 5xx errors, which indicated problems on the server side. Users across industries—from retail to healthcare—were abruptly locked out of Zendesk portals. Support teams couldn’t view tickets, update cases, or access customer history.

#2 Service degradation

Some services needed additional time to restart, leading to intermittent errors—often at the worst possible moments for businesses trying to handle customer inquiries. As a result, support agents wrestled with inconsistent access and were forced to pause or redo tasks.

#3 Impact on communication channels

Zendesk’s core support tools went offline for large sections of the outage, limiting response times and workflow coordination. Web widgets used on company websites for direct customer engagement also went down, frustrating users who expected immediate assistance or quick self-service options.

#4 Prolonged resolution window

Even though Zendesk reported “majority of services” were restored by March 21, intermittent errors lingered for more than 24 hours. This may have forced businesses to switch to manual processes.  

How Internet Sonar caught the outage before Zendesk did

Catchpoint’s Internet Sonar flagged the outage at 15:22 AM UTC, 21 minutes before Zendesk’s internal alerts.  

Internet Sonar dashboard

The Internet Sonar dashboard shows Zendesk’s outage affecting multiple global locations, with 100% downtime reported across several cities.  

Scatterplot showing multiple tests run against Zendesk domain failing

The scatterplot above from Internet Sonar visualizes the Zendesk outage, showing a surge in failed tests (red markers) starting around 15:22 AM UTC. The concentration of failures indicates a widespread service disruption.

503 Service Unavailable errors affecting Zendesk

Key takeaways from the Zendesk outage

The Zendesk outage underscores why real-time visibility, proactive monitoring, and a deep understanding of third-party dependencies are critical to maintaining Internet resilience.

1. The cost of delayed root cause analysis

Zendesk’s internal team took 21 minutes to correlate user reports with the 503 errors already flagged by Catchpoint’s Internet Sonar. While this might not seem like a long time, every minute of downtime means lost revenue, frustrated customers, and operational disruptions. The longer it takes to pinpoint the issue, the longer it takes to fix it.  

Without immediate visibility into where and why a problem is occurring, IT teams waste precious time in war rooms and finger-pointing exercises, trying to determine if the issue is internal or caused by a third-party provider.  

2. Independent proactive monitoring of your Internet Stack is essential

No organization can afford to operate without independent, proactive visibility into their digital ecosystem. That’s where Catchpoint’s suite of Internet Performance Monitoring (IPM) tools comes in.  

A screenshot of a computerAI-generated content may be incorrect.

Catchpoint’s Internet Sonar

Internet Sonar is a powerful tool that eliminates guesswork by providing real-time, independent Internet health data. Using Internet Sonar, you’ll know whenever a third party has an outage, where they have it, how long it's been going on, and whether or not it's likely to affect you.  

Internet Sonar doesn’t just tell you when a site is down after people tweet about it. This means no finger-pointing, no war rooms, just straightforward, intelligent, and reliable Internet health information so you can get ahead of productivity or experience-impacting 3rd-party incidents.  

3. The hidden challenges of multi-pod infrastructure

The Zendesk outage also exposed vulnerabilities of multi-pod architectures, where failures in one pod cascaded into issues across multiple regions. While these architectures are designed for scalability and redundancy, they introduce complexities that can extend downtime when something goes wrong.

In this case, even after initial recovery, intermittent failures continued for over 24 hours, preventing full service restoration. For companies reliant on cloud-based applications like Zendesk, this reinforces the need for deep visibility into third-party infrastructure dependencies to understand:

  • Where failures are occurring
  • How they impact interconnected systems
  • How long the recovery process might take

Catchpoint’s Internet Stack Map can help with this by showing a live view of the health of your digital service and the services it depends on.  

A screenshot of a computerAI-generated content may be incorrect.

Catchpoint’s Internet Stack Map

By automatically discovering third-party dependencies, it helps organizations understand the health of their digital ecosystem at a glance. When one component fails, it’s clearly highlighted, making root-cause analysis seamless.  

Learn more about preventing outages from our guide, or test drive Catchpoint for yourself in our guided product tour.

Summary

On March 20, 2025, starting at 15:43 AM UTC, Zendesk users globally encountered 503 “Service Unavailable” errors and 5xx server-side issues, disrupting access to critical support tools and communication channels. While immediate mitigations stabilized core services, intermittent issues continued for over 24 hours, underscoring the complexity of multi-pod infrastructure failures.

Timeline of Events

  • 15:29 AM UTC: Zendesk’s internal team confirmed user reports of access issues.
  • 15:50 AM UTC: Root cause identified as widespread 503 errors impacting multiple service pods.
  • March 21, 2025, 10:59 PM UTC (06:29 AM EDT): Zendesk’s status page confirmed recovery of “majority of issues,” with ongoing efforts to resolve lingering intermittent failures.

Error message on accessing Zendesk services

The 503 Service Unavailable responses were not immediately recognized as the root cause of the disruption. This delay hindered Zendesk’s ability to fully understand the scope and impact of the outage in a timely manner and slowed the process of restoring service and assisting customers.

When Zendesk went down, businesses felt the impact

The Zendesk outage crippled workflows for thousands of businesses relying on the platform for customer support, sales, and internal collaboration. With critical processes disrupted, teams struggled to deliver timely and effective customer service. Below are some of the key consequences:

#1 Access issues across multiple pods

Many users encountered 5xx errors, which indicated problems on the server side. Users across industries—from retail to healthcare—were abruptly locked out of Zendesk portals. Support teams couldn’t view tickets, update cases, or access customer history.

#2 Service degradation

Some services needed additional time to restart, leading to intermittent errors—often at the worst possible moments for businesses trying to handle customer inquiries. As a result, support agents wrestled with inconsistent access and were forced to pause or redo tasks.

#3 Impact on communication channels

Zendesk’s core support tools went offline for large sections of the outage, limiting response times and workflow coordination. Web widgets used on company websites for direct customer engagement also went down, frustrating users who expected immediate assistance or quick self-service options.

#4 Prolonged resolution window

Even though Zendesk reported “majority of services” were restored by March 21, intermittent errors lingered for more than 24 hours. This may have forced businesses to switch to manual processes.  

How Internet Sonar caught the outage before Zendesk did

Catchpoint’s Internet Sonar flagged the outage at 15:22 AM UTC, 21 minutes before Zendesk’s internal alerts.  

Internet Sonar dashboard

The Internet Sonar dashboard shows Zendesk’s outage affecting multiple global locations, with 100% downtime reported across several cities.  

Scatterplot showing multiple tests run against Zendesk domain failing

The scatterplot above from Internet Sonar visualizes the Zendesk outage, showing a surge in failed tests (red markers) starting around 15:22 AM UTC. The concentration of failures indicates a widespread service disruption.

503 Service Unavailable errors affecting Zendesk

Key takeaways from the Zendesk outage

The Zendesk outage underscores why real-time visibility, proactive monitoring, and a deep understanding of third-party dependencies are critical to maintaining Internet resilience.

1. The cost of delayed root cause analysis

Zendesk’s internal team took 21 minutes to correlate user reports with the 503 errors already flagged by Catchpoint’s Internet Sonar. While this might not seem like a long time, every minute of downtime means lost revenue, frustrated customers, and operational disruptions. The longer it takes to pinpoint the issue, the longer it takes to fix it.  

Without immediate visibility into where and why a problem is occurring, IT teams waste precious time in war rooms and finger-pointing exercises, trying to determine if the issue is internal or caused by a third-party provider.  

2. Independent proactive monitoring of your Internet Stack is essential

No organization can afford to operate without independent, proactive visibility into their digital ecosystem. That’s where Catchpoint’s suite of Internet Performance Monitoring (IPM) tools comes in.  

A screenshot of a computerAI-generated content may be incorrect.

Catchpoint’s Internet Sonar

Internet Sonar is a powerful tool that eliminates guesswork by providing real-time, independent Internet health data. Using Internet Sonar, you’ll know whenever a third party has an outage, where they have it, how long it's been going on, and whether or not it's likely to affect you.  

Internet Sonar doesn’t just tell you when a site is down after people tweet about it. This means no finger-pointing, no war rooms, just straightforward, intelligent, and reliable Internet health information so you can get ahead of productivity or experience-impacting 3rd-party incidents.  

3. The hidden challenges of multi-pod infrastructure

The Zendesk outage also exposed vulnerabilities of multi-pod architectures, where failures in one pod cascaded into issues across multiple regions. While these architectures are designed for scalability and redundancy, they introduce complexities that can extend downtime when something goes wrong.

In this case, even after initial recovery, intermittent failures continued for over 24 hours, preventing full service restoration. For companies reliant on cloud-based applications like Zendesk, this reinforces the need for deep visibility into third-party infrastructure dependencies to understand:

  • Where failures are occurring
  • How they impact interconnected systems
  • How long the recovery process might take

Catchpoint’s Internet Stack Map can help with this by showing a live view of the health of your digital service and the services it depends on.  

A screenshot of a computerAI-generated content may be incorrect.

Catchpoint’s Internet Stack Map

By automatically discovering third-party dependencies, it helps organizations understand the health of their digital ecosystem at a glance. When one component fails, it’s clearly highlighted, making root-cause analysis seamless.  

Learn more about preventing outages from our guide, or test drive Catchpoint for yourself in our guided product tour.

This is some text inside of a div block.