惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

Catchpoint Blog

SRE Report: AI optimism and the economics of effort SRE Report: Why fast is what users trust The SRE Report 2026: Defensible Ns SRE Report 2026: What surprised us, what didn't, and why the gaps matter most Why Synthetic Tracing Delivers Better Data, Not Just More Data A New Chapter: LogicMonitor + Catchpoint – A Personal Note from Mehdi Mezmo + Catchpoint deliver observability SREs can rely on The four pillars holding up your digital business, and what happens when they crumble When payments pause: lessons from a global payments outage Observability 2025 Decoded: What the DZone Report Means for SLO-Driven Ops The next evolution of WebPageTest has arrived, and it’s a game-changer The Monitoring Blind Spot That Could Cost You Black Friday Powering Mexico’s Digital Future: Expanded Internet Observability with Catchpoint The Next Chapter of WebPageTest: Your New Experience Starts Soon SRE Report Retrospectives — Have AIOps Predictions Held Up? When BGP becomes UX: The inside story of a SaaS routing decision gone wrong (or right) Session Replay explained: A guide to seeing digital experience through your user’s eyes Making the invisible visible: Are your cloud firewalls and DDoS protection really working? Why it’s time to move beyond APM: Monitoring from the user’s perspective When metrics mislead: Inside the 2025 Retail Web Performance Benchmark The vendor trap: why your next outage won’t be your fault—but will be your problem LLMs don’t stand still: How to monitor and trust the models powering your AI Semantic Caching: What We Measured, Why It Matters The Annual SRE Survey Is Open—We Want to Hear from You Observability isn’t about the tool. It’s about the truth Invisible dependencies, visible impact: Lessons from the Google Cloud outage Real-time detection of BGP blackholing and prefix hijacks Leading analyst firm reveals the real cost of internet disruptions The Power of Over 3000 Intelligent Observability Agents Monitoring in the Age of Complexity: 5 Assumptions CIOs Need to Rethink Why Intelligent Traffic Steering is Critical for Performance and Cost Optimization Retail digital performance event recap: Key insights from IBM & Catchpoint Zendesk outage: A case for proactive monitoring and faster incident response Silence during chaos: Why the X outage is a call to arms for proactive monitoring When AI tools fail: How to map your AI dependencies for proactive visibility Why Super Bowl 2025 was a triumph for Internet Resilience Why Internet Performance Monitoring is the new health check for IT organizations Why use Playwright in Catchpoint for synthetic monitoring Introducing WebPageTest Expert Plan: Real-Time Insights, Synthetic + RUM together in One Platform The shift to digital: How businesses are reshaping their priorities for 2025 The SRE Report 2025's Call to Action Monitoring in the Age of the Internet: DEM, IPM, and APM—What You Need to Know Creating the IPM Category: Catchpoint’s Journey to Leadership and the LogicMonitor Era 5 Actions you can take to improve digital performance 2024: A banner year for Internet Resilience APM vs Observability: Both-and, not either-or APM vs observability: why your definitions are broken APM vs Observability: What comes next? APM vs Observability: Observing beyond APM AWS Outage: How do you prepare for the failure of your own safety net? Agentic AI: Powerful But Fragile—What You Need to Know Catchpoint Peak Performance Summit 2025: Redefining Observability for the Outcome Economy Connected Devices: Unlocking the next frontier of Internet Performance Monitoring Cloud Monitoring's Blind Spot: The User Perspective Cloudflare’s Resolver Outage: More Than Just DNS Cloudflare outage: another wake-up call for resilience planning How to Monitor AI Agents in Commerce Systems Critical Requirements for Modern API Monitoring Diagnosing Wi-Fi failures that traditional tools miss: a case study Escalating risk, shrinking margins: The 2025 Internet Resilience Report Fast and furious: The importance of performance in the digital age Getting Started with Traceroute From the source to the edge: the six agent types you can’t ignore From SEO to AEO: Why Web Performance Is the Key to AI Search Success Here’s the proof: What the fastest sites on the web have in common Google’s Agent-to-Agent (A2A) Protocol is here—Now Let’s Make it Observable How IPM helped a top tech brand catch an OpenAI outage before it became a crisis How AI Turns Monitoring From “What Now?” Into “What’s Next?” How SAP achieved world-class uptime through modern observability
The $1 Million Lesson: Building a Culture of Quality Through SLAs
2025-03-07 · via Catchpoint Blog

In the early days of DoubleClick, back when SaaS was still known as Application Service Provider (ASP), I was tasked with setting up the QoS (Quality of Service) Team. Our primary mission was to establish a monitoring system, but we quickly found ourselves managing Service Level Agreements (SLAs)—a task that became critical after we paid out over $1 million in penalties for SLA violations to a single customer. The reason? Someone had signed a contract promising 100% uptime, an impossible commitment.  

This is the story of how we took control of our SLAs, stopped the financial bleeding, and built a culture of quality around service metrics. Whether you’re managing SLAs today or just curious about how they work, this post will provide valuable insights into the challenges we faced, the solutions we implemented, and the lessons we learned along the way.

What are SLAs?

A clipboard with a pen and paper clipsAI-generated content may be incorrect.

An SLA (Service Level Agreement) is a contractual agreement between a vendor and a customer that outlines the expected level of service. Under this legal umbrella, you’ll find Service Level Objectives (SLOs), which define specific metrics like uptime, speed, or transactions per second.

At DoubleClick, we defined SLAs with the following principles in mind:

  • Attainable: The goals should be realistic.
  • Repeatable: The metrics should be consistently measurable.
  • Measurable: The performance should be quantifiable.
  • Meaningful: The metrics should matter to the business.
  • Mutually Acceptable: Both parties should agree on the terms.

SLAs benefit both the customer and the vendor. For customers, they provide objective grading criteria and protection from poor service. For vendors, they set clear expectations and incentivize quality improvements

Ground zero, discovery

When we first tackled the SLA problem, we were in crisis mode. The first step was to compile a list of all contracts, extract the SLAs and SLOs, and document the associated penalties. We stored this information in a database and began educating stakeholders—business leaders, legal teams, and executives—about the importance of SLAs.

From the beginning, we focused on end-user experience-based SLAs. This meant measuring performance from the user’s perspective, not just from the server’s perspective.

A universal challenge

Over the years, I’ve seen many companies face similar issues. Not all SRE and Dev teams fully grasp the SLAs their organization has with customers—they often focus heavily on internal SLOs while overlooking how those metrics tie directly to contractual commitments. For instance, after facing significant penalties, companies like Slack revised their SLA terms to better align internal goals with customer promises.

SLA Application Performance

A person typing on a computerAI-generated content may be incorrect.

Establishing an SLA is more than just putting a few sentences in the contract. The reason we paid $1 million is that there was no SLA Management System in place. We started then by building a Service Level Management (SLM) practice that relied on 4 pillars: Administration, Monitoring, Reporting, and Compliance (AMRC).

The SLM process

We sat down with business partners, customers, legal, and finance teams to create a process that would prevent costly mistakes in the future. This process, which we called the SLA lifecycle, was reviewed quarterly to ensure it remained effective and aligned with our business goals.

  1. Risk simulations with data science: One of the most critical steps in our SLM process was using our in-house data scientists to run simulations. These simulations analyzed historical data from our monitoring tools to assess the risk of breaching SLAs. The goal was to set realistic SLAs that wouldn’t be breached every day, while still meeting customer expectations.
  1. “What-if” scenarios: We also ran multiple “what-if” scenarios to understand the relationship between availability and revenue. These scenarios helped us evaluate the impact of downtime at different hours of the day and days of the week. For example, we could see how a 10-minute outage during peak traffic hours would affect revenue compared to the same outage during off-peak times.
  1. The SLA desk: To streamline the process, we created an online tool in 2001—essentially an “SLA desk”—that allowed our sales team to request SLA portfolios for customers. These requests were reviewed and approved by our QoS team, ensuring that every SLA was realistic, measurable, and aligned with our capabilities.

Aligning external and internal SLAs

One of the biggest challenges we faced was the mismatch between external SLAs (what we promised customers) and internal SLAs (what we measured internally). For example, customers would ask for ad-serving uptime, while our tech team measured server availability.

To solve this, we aligned our external and internal SLOs and made the internal objectives (the targets) very very high. This was a huge victory because it allowed us to rely on one set of metrics to understand our SLA risk position and drive operational excellence. Our tech group (Ops, Engineering, etc.) also became more sensitive to the notion of a business SLA and started to care a lot about not breaching them.

Monitoring – the key to SLA success

For availability and performance, we relied on three synthetic products. Internally, we ran Sitescope in 17 data centers and used two external synthetic products. We wanted to have as many data points as possible from as many tools as possible. The stakes were just too high not to invest in multiple tools. This entire SLM project was not cheap to implement and run on an annual basis, but I also knew the cost of not doing it right the hard way.

For monitoring, it became clear we needed to test as often as possible from as many vantage points as possible:

  • If you only check your SLO endpoints once an hour, you must wait 59 minutes between checks. That gap can lead to false downtime alerts.
  • You also need many data points to ensure statistical significance. Smaller datasets lower precision and power, while larger one’s help manage false positives and false negatives.

Enter Differential Performance Measurement (DPM)

One of our biggest challenges was finding an effective way to measure the ad delivery speed and capture it in our SLAs. Clients would look at their site performance and notice spikes and they would attribute it to our system, meanwhile our performance telemetry would not show any problems. We couldn’t correlate the two charts; therefore we couldn’t come to an agreement whether it was our problem or someone else’s problem.

image

image

To address this, we developed a methodology called Differential Performance Measurement (DPM). Our goal was to measure Doubleclick’s performance and availability with precision, and to understand how it affected our customers’ pages. We also wanted to be accountable for what we controlled, so we could avoid blame and finger-pointing.

The methodology added context to the measurements. DPM introduced clarity and comparison, removing absolute performance numbers from the SLAs.

Recipe for Differential Performance Measurement (example with an advert.):

  1. Take two pages—one without ads and one with a single ad call.
  • Page A = No ads
  • Page B = One ad
  1. Make sure the pages do not contain any other third-party references (CDNs, etc.).
  1. Make sure the page sizes (in KB) are the same.
  1. “Bake” – Measure response times for both pages and you get the following metrics:
  • Differential Response (DR) will be (Response Time of page B) minus (Response Time of page A)
  • Differential Response Percentage (DRP) = DR / A. (e.g. If Page A is 2 seconds, and Page B is 2.1 seconds, DR is 0.1 second, and DRP is 0.1/2=0.05 or 5%)

This approach helped eliminate noise caused by:

  1. Internet-related issues beyond our control (e.g., fiber cuts).
  1. Monitoring agent inconsistencies (raising the need to monitor our monitoring tools).
  1. Other third-party dependencies.

To visualize the impact of Differential Performance Measurement (DPM), the chart below compares response times for two scenarios

image

Scenario 1: The ad-serving company experienced performance issues, which negatively impacted the customer’s site. The vendor breached the SLA threshold between Time 4 and Time 8.

Scenario 2: The website itself encountered performance problems, unrelated to the ad-serving company.

Reporting – Transparency and Accountability

After the $1 million penalty, SLA management became a top priority, with visibility extending all the way to the CEO. We reported monthly on compliance and breaches, using tools like DigitalFuel to detect issues in real-time.

By the end of 2001, we were tracking over 100 Operational Level Agreements (OLAs), and a Culture of Quality had emerged at DoubleClick. Everyone—from engineers to executives—was aligned around business service metrics, and no one wanted to breach an SLA.

Lessons learned and the road ahead

Implementing a comprehensive SLM process at DoubleClick allowed us to:

  • Manage hundreds of contracts with up to five SLOs each.
  • Offer scalable SLAs that could adapt to new products.
  • Reduce financial risks by avoiding costly penalties.
  • Maintain our reputation by providing accurate and meaningful SLAs.
  • Detect breaches in real-time, allowing us to take proactive measures.

One of the biggest advantages was knowing in advance when an SLA was at risk. For example, we could predict that adding four minutes of downtime would breach 12 contracts and result in $X in penalties. This insight helped our Ops team act—pausing releases or preventing any changes that could impact uptime.

Some people dismiss SLAs, and in many cases, that skepticism is justified. Bad SLAs—those with unrealistic guarantees, no real penalties, or vague measurement criteria—undermine trust. I often see SLAs promising 0% packet loss, but when you ask how it’s measured, you quickly realize it’s meaningless. These kinds of SLAs give the entire concept a bad reputation.

However, when done right, SLAs are essential. They align customers and vendors, reduce friction, and eliminate blame games. That said, customers need to demand useful SLAs—not just ones that sound good on paper. The goal isn’t to drive vendors out of business but to hold them accountable. If they fail to deliver, they should feel the impact.

The evolution of SLAs

Back in 2001, we knew SLA management was critical, but could we have predicted how integral it would become in today’s cloud-driven world? SLAs have evolved from simple uptime guarantees to complex agreements that cover everything from latency to data residency. XLOs (Experience Level Objectives) are a thing—metrics that focus on the customer’s experience, not just the server’s performance. This shift in focus—from internal metrics to customer outcomes—is the future of performance management.

Stay tuned for Part 2, where we’ll explore how businesses can align their internal metrics with what truly matters: the customer’s experience

Learn more

New to SLAs, SLOs, and SLIs? Read this post to learn the fundamentals, best practices, and how they impact service reliability.

Summary

In the early days of DoubleClick, back when SaaS was still known as Application Service Provider (ASP), I was tasked with setting up the QoS (Quality of Service) Team. Our primary mission was to establish a monitoring system, but we quickly found ourselves managing Service Level Agreements (SLAs)—a task that became critical after we paid out over $1 million in penalties for SLA violations to a single customer. The reason? Someone had signed a contract promising 100% uptime, an impossible commitment.  

This is the story of how we took control of our SLAs, stopped the financial bleeding, and built a culture of quality around service metrics. Whether you’re managing SLAs today or just curious about how they work, this post will provide valuable insights into the challenges we faced, the solutions we implemented, and the lessons we learned along the way.

What are SLAs?

A clipboard with a pen and paper clipsAI-generated content may be incorrect.

An SLA (Service Level Agreement) is a contractual agreement between a vendor and a customer that outlines the expected level of service. Under this legal umbrella, you’ll find Service Level Objectives (SLOs), which define specific metrics like uptime, speed, or transactions per second.

At DoubleClick, we defined SLAs with the following principles in mind:

  • Attainable: The goals should be realistic.
  • Repeatable: The metrics should be consistently measurable.
  • Measurable: The performance should be quantifiable.
  • Meaningful: The metrics should matter to the business.
  • Mutually Acceptable: Both parties should agree on the terms.

SLAs benefit both the customer and the vendor. For customers, they provide objective grading criteria and protection from poor service. For vendors, they set clear expectations and incentivize quality improvements

Ground zero, discovery

When we first tackled the SLA problem, we were in crisis mode. The first step was to compile a list of all contracts, extract the SLAs and SLOs, and document the associated penalties. We stored this information in a database and began educating stakeholders—business leaders, legal teams, and executives—about the importance of SLAs.

From the beginning, we focused on end-user experience-based SLAs. This meant measuring performance from the user’s perspective, not just from the server’s perspective.

A universal challenge

Over the years, I’ve seen many companies face similar issues. Not all SRE and Dev teams fully grasp the SLAs their organization has with customers—they often focus heavily on internal SLOs while overlooking how those metrics tie directly to contractual commitments. For instance, after facing significant penalties, companies like Slack revised their SLA terms to better align internal goals with customer promises.

SLA Application Performance

A person typing on a computerAI-generated content may be incorrect.

Establishing an SLA is more than just putting a few sentences in the contract. The reason we paid $1 million is that there was no SLA Management System in place. We started then by building a Service Level Management (SLM) practice that relied on 4 pillars: Administration, Monitoring, Reporting, and Compliance (AMRC).

The SLM process

We sat down with business partners, customers, legal, and finance teams to create a process that would prevent costly mistakes in the future. This process, which we called the SLA lifecycle, was reviewed quarterly to ensure it remained effective and aligned with our business goals.

  1. Risk simulations with data science: One of the most critical steps in our SLM process was using our in-house data scientists to run simulations. These simulations analyzed historical data from our monitoring tools to assess the risk of breaching SLAs. The goal was to set realistic SLAs that wouldn’t be breached every day, while still meeting customer expectations.
  1. “What-if” scenarios: We also ran multiple “what-if” scenarios to understand the relationship between availability and revenue. These scenarios helped us evaluate the impact of downtime at different hours of the day and days of the week. For example, we could see how a 10-minute outage during peak traffic hours would affect revenue compared to the same outage during off-peak times.
  1. The SLA desk: To streamline the process, we created an online tool in 2001—essentially an “SLA desk”—that allowed our sales team to request SLA portfolios for customers. These requests were reviewed and approved by our QoS team, ensuring that every SLA was realistic, measurable, and aligned with our capabilities.

Aligning external and internal SLAs

One of the biggest challenges we faced was the mismatch between external SLAs (what we promised customers) and internal SLAs (what we measured internally). For example, customers would ask for ad-serving uptime, while our tech team measured server availability.

To solve this, we aligned our external and internal SLOs and made the internal objectives (the targets) very very high. This was a huge victory because it allowed us to rely on one set of metrics to understand our SLA risk position and drive operational excellence. Our tech group (Ops, Engineering, etc.) also became more sensitive to the notion of a business SLA and started to care a lot about not breaching them.

Monitoring – the key to SLA success

For availability and performance, we relied on three synthetic products. Internally, we ran Sitescope in 17 data centers and used two external synthetic products. We wanted to have as many data points as possible from as many tools as possible. The stakes were just too high not to invest in multiple tools. This entire SLM project was not cheap to implement and run on an annual basis, but I also knew the cost of not doing it right the hard way.

For monitoring, it became clear we needed to test as often as possible from as many vantage points as possible:

  • If you only check your SLO endpoints once an hour, you must wait 59 minutes between checks. That gap can lead to false downtime alerts.
  • You also need many data points to ensure statistical significance. Smaller datasets lower precision and power, while larger one’s help manage false positives and false negatives.

Enter Differential Performance Measurement (DPM)

One of our biggest challenges was finding an effective way to measure the ad delivery speed and capture it in our SLAs. Clients would look at their site performance and notice spikes and they would attribute it to our system, meanwhile our performance telemetry would not show any problems. We couldn’t correlate the two charts; therefore we couldn’t come to an agreement whether it was our problem or someone else’s problem.

image

image

To address this, we developed a methodology called Differential Performance Measurement (DPM). Our goal was to measure Doubleclick’s performance and availability with precision, and to understand how it affected our customers’ pages. We also wanted to be accountable for what we controlled, so we could avoid blame and finger-pointing.

The methodology added context to the measurements. DPM introduced clarity and comparison, removing absolute performance numbers from the SLAs.

Recipe for Differential Performance Measurement (example with an advert.):

  1. Take two pages—one without ads and one with a single ad call.
  • Page A = No ads
  • Page B = One ad
  1. Make sure the pages do not contain any other third-party references (CDNs, etc.).
  1. Make sure the page sizes (in KB) are the same.
  1. “Bake” – Measure response times for both pages and you get the following metrics:
  • Differential Response (DR) will be (Response Time of page B) minus (Response Time of page A)
  • Differential Response Percentage (DRP) = DR / A. (e.g. If Page A is 2 seconds, and Page B is 2.1 seconds, DR is 0.1 second, and DRP is 0.1/2=0.05 or 5%)

This approach helped eliminate noise caused by:

  1. Internet-related issues beyond our control (e.g., fiber cuts).
  1. Monitoring agent inconsistencies (raising the need to monitor our monitoring tools).
  1. Other third-party dependencies.

To visualize the impact of Differential Performance Measurement (DPM), the chart below compares response times for two scenarios

image

Scenario 1: The ad-serving company experienced performance issues, which negatively impacted the customer’s site. The vendor breached the SLA threshold between Time 4 and Time 8.

Scenario 2: The website itself encountered performance problems, unrelated to the ad-serving company.

Reporting – Transparency and Accountability

After the $1 million penalty, SLA management became a top priority, with visibility extending all the way to the CEO. We reported monthly on compliance and breaches, using tools like DigitalFuel to detect issues in real-time.

By the end of 2001, we were tracking over 100 Operational Level Agreements (OLAs), and a Culture of Quality had emerged at DoubleClick. Everyone—from engineers to executives—was aligned around business service metrics, and no one wanted to breach an SLA.

Lessons learned and the road ahead

Implementing a comprehensive SLM process at DoubleClick allowed us to:

  • Manage hundreds of contracts with up to five SLOs each.
  • Offer scalable SLAs that could adapt to new products.
  • Reduce financial risks by avoiding costly penalties.
  • Maintain our reputation by providing accurate and meaningful SLAs.
  • Detect breaches in real-time, allowing us to take proactive measures.

One of the biggest advantages was knowing in advance when an SLA was at risk. For example, we could predict that adding four minutes of downtime would breach 12 contracts and result in $X in penalties. This insight helped our Ops team act—pausing releases or preventing any changes that could impact uptime.

Some people dismiss SLAs, and in many cases, that skepticism is justified. Bad SLAs—those with unrealistic guarantees, no real penalties, or vague measurement criteria—undermine trust. I often see SLAs promising 0% packet loss, but when you ask how it’s measured, you quickly realize it’s meaningless. These kinds of SLAs give the entire concept a bad reputation.

However, when done right, SLAs are essential. They align customers and vendors, reduce friction, and eliminate blame games. That said, customers need to demand useful SLAs—not just ones that sound good on paper. The goal isn’t to drive vendors out of business but to hold them accountable. If they fail to deliver, they should feel the impact.

The evolution of SLAs

Back in 2001, we knew SLA management was critical, but could we have predicted how integral it would become in today’s cloud-driven world? SLAs have evolved from simple uptime guarantees to complex agreements that cover everything from latency to data residency. XLOs (Experience Level Objectives) are a thing—metrics that focus on the customer’s experience, not just the server’s performance. This shift in focus—from internal metrics to customer outcomes—is the future of performance management.

Stay tuned for Part 2, where we’ll explore how businesses can align their internal metrics with what truly matters: the customer’s experience

Learn more

New to SLAs, SLOs, and SLIs? Read this post to learn the fundamentals, best practices, and how they impact service reliability.

This is some text inside of a div block.