惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

I
Intezer
V
Vulnerabilities – Threatpost
Google Online Security Blog
Google Online Security Blog
T
The Exploit Database - CXSecurity.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
AWS News Blog
AWS News Blog
G
GRAHAM CLULEY
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
Cybersecurity and Infrastructure Security Agency CISA
N
News | PayPal Newsroom
T
Tenable Blog
Spread Privacy
Spread Privacy
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
S
Secure Thoughts
P
Privacy International News Feed
IT之家
IT之家
Project Zero
Project Zero
T
The Blog of Author Tim Ferriss
Engineering at Meta
Engineering at Meta
大猫的无限游戏
大猫的无限游戏
博客园_首页
GbyAI
GbyAI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
量子位
雷峰网
雷峰网
Apple Machine Learning Research
Apple Machine Learning Research
Hacker News: Ask HN
Hacker News: Ask HN
Google DeepMind News
Google DeepMind News
MongoDB | Blog
MongoDB | Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
NISL@THU
NISL@THU
I
InfoQ
D
DataBreaches.Net
有赞技术团队
有赞技术团队
K
Kaspersky official blog
Security Latest
Security Latest
The Register - Security
The Register - Security
Hugging Face - Blog
Hugging Face - Blog
S
Security @ Cisco Blogs
P
Proofpoint News Feed
M
MIT News - Artificial intelligence
H
Hackread – Cybersecurity News, Data Breaches, AI and More
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
AI
AI
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
P
Proofpoint News Feed
Security Archives - TechRepublic
Security Archives - TechRepublic
N
News and Events Feed by Topic

Comments for zhengziying.com

从个人主页到微博 六问 听说阿里内网不删帖 听说阿里内网不删帖 谁推翻了“罗伊诉韦德案” 听说阿里内网不删帖 我家的一次制度设计尝试 我在交大 BBS 删帖的日子 美国不是一个法治国家 我家的一次制度设计尝试 Monitoring and QA are the Same Thing? (Part 3) 六问
Monitoring and QA are the Same Thing? (Part 3)
2022-11-23 · via Comments for zhengziying.com

In Part 2, I pointed out that live site monitoring is about answering two questions: “is everything working?” and “is everything healthy?”, and invariants, metrics/logs and synthetic transactions are the three ways to find out the answers.

For those who are building/using/improving live site monitoring, besides how to find out the answers, they also need to be aware of and consider the following four aspects:

I. Knowing what has changed in production helps the monitoring more effectively answer “is everything working” and “is everything healthy”.

Some changes are triggered by human. For example, a code change or a configuration change. Pragmatic data shows that about 2/3 live site incidents were triggered by code changes and configuration changes (including the situation of a pre-existing bug, which may continue to remain dormant, until it’s triggered/surfaced by a new version rollout or configuration change). So after rolling out a new version or having flipped on/off a configuration setting, we want to get the answers the sooner the better (it’s Ok and reasonable to first have some preliminary assessment quickly, then take a bit more time to get a fuller assessment). Similarly, when a live site issue is fixed, we also want to get the answer ASAP to confirm the fix. When there are manually triggered changes, we know about the delta (in code or in configuration), so that we may first look at the areas related to the delta when trying to answer “is everything working” and “is everything healthy”.

Some changes are not triggered by human. They are just naturally happening as time goes by, or under the influence of the external environment. For examples: the message queue length grows as incoming traffic picks up; the data store size grows over the time; a system is slowly leaking resources; password expired, certificate expired, customer accounts are deleted by an automatic backend job after 90 days non-payment, etc.. Some of such changes can build up very quickly, such as a surge of simultaneous connection count during men’s ice hockey semifinal between the USA and Canada at the 2014 Winter Olympics. Knowing what are changing even when no one is touching the system, it helps target the monitoring more precisely.

II. Understand the confidence level of the answer.

Anyone can give an answer to the question “is everything working”. I could answer “yes” out of my ignorance (“I haven’t heard about any incidents”). That’s a legitimate answer, but a very low confidence answer. When it comes to live site monitoring and answer the questions “is everything working” and “is everything healthy”, we need the answers to be in higher confidence, to reduce false alarm or false positive.

The bar of “high confidence” may vary. For example, we may tune the monitoring system to be a little bit more conservative during the nights, so that we don’t wake people up too easily and prematurely. We could be more aggressive (lowering the bar) during working hours or special occasions (e.g. the winter Olympic).

Time is the key factor for a monitoring system to gain confidence on the answer (either positive or negative). For a slow-building issue, it usually takes hours or days to confirm it. To differentiate between a jump vs. a single spike, it needs to collect the data for a bit longer. In live site monitoring, we often quickly do a sanity check to give a preliminary answer (low confidence), then spend more time to be sure (higher confidence).

In one word, it takes time to get higher confidence. That’s why shorter MTTD (mean time to detect) and low noise ratio is mutually exclusive in general. That seems pretty obvious here. But in reality, many people can forget that in day-to-day work, especially in complex context. I have seen that as a common pitfall. People make designs which try to get shorter MTTD and lower noise ration at the same time. Some leadership sometimes challenge the team to improve both — it’s not unachievable, but harder than most people think.

III. Understand the different levels of turnaround time (usually as referred as MTTD, mean time to detect) and understand what kind of monitoring approach (invariants, metrics/logs and synthetic transactions) we should invest in to either move into a higher level of responsiveness or improve within the same level.

The basic level of turnaround time is to know the issue after customers have run into it. Synthetic transactions may not be the best place to invest in, if we want to shorten the time from the first several customers have hit the issue to the time we know about it. Instead, we should more rely on detecting anomalies and outliers based on aggregated logs.

It will be better if we know the issue before it affects any customer. That’s a much better turnaround time. In order to get ahead of the customers, we must use the synthetic transaction approach. The other two approaches (invariants and metrics/logs) cannot help when there is no customer impact yet. However, as pointed out in Part 2, synthetic transaction can become very expensive if we want to use it to cover more granular cases. Which means, to balance the cost and benefit, it will be more practical and realistic to only invest in catching major problems ahead of customers and let the issues in granular cases be there until some customers are affected. In the other words, catching all the live site issues ahead of customers should not be the North Star.

Some may say, shouldn’t the most ideal turnaround time be the negative turnaround time: detect the issue even before it exists in production. Of course that’s even better. But that is no longer a live site monitoring thing. Preventing issues from getting into live site is a QA responsibility, a software testing things.

IV. How will the answers be delivered?

Many people equals this to sending alerts. But sending alert is just one way to deliver the answers of “is it working” & “is it healthy”. There are many other delivery mechanisms. Some groups have a big flat screen TV on the way in their hallway, which shows a bunch of real time numbers and charts. When any issue happens, the numbers would turn red or flashing and the bar/line in the chart will shoot up high. Then it will get noticed by people who walk by. Such a flat screen TV is also a mechanism to deliver the answer. Sometime the answer is delivered without being request, such as when some thresholds are breached in the night, the on-call person will be called.

The differences between the delivery mechanisms are:

  1. Is the delivery guaranteed? Flat screen TV is not a guaranteed delivery, since we can’t make sure the right people will just walk by and noticed the red numbers. Emails and text messages are not guaranteed delivery, either. People may not be checking email and text message all the time. Calling on-call person’s cell phone is a guaranteed delivery: unless the on-call person answers the phone call, the call will be retried multiple times, and fall back on to the secondary on-call and the backup on-call, until someone answers the phone.
  2. Is the delivery explicit or implicit? Implicit: no bad news is good news. Explicit: we still want to keep receiving good news (“yes, everything is working fine in the last 30 minutes”), in order to have the peace of mind about the delivery channel (to avoid wondering whether there is no bad news, or the bad news got lost on its way?).
  3. How soon/fast is the answer delivered? It depends on factors including: how bad it is (severity, impact assessment); how much are we sure about it (confidence level). Usually there is a conflict between confidence level vs. how soon: we can aggressively send alerts, at the cost of high noise ratio, or we tune the monitoring system to wait until it’s very much sure about the issue before sending the alerts, at the cost of longer MTTD (mean time to detection).
  4. Who is the recipient? Phone call is a guaranteed delivery, but only one person will receive it at a time. Emails can be sent to a group of person of our choice. Flashing screens in the hallway will be seen, but we don’t know who exactly will see it. We also want the message get delivered to the right people, who need to know and/or can do something about.

Among all the ways, no one is the best way. There is only the right way, in different situations. “Right” = deliver the message to the right audience, with the right balance between shorter delay and higher confidence level, and containing the right level of details and actionable data.


Summary: this blog series (Part 1, Part 2, Part 3) captures the mind model that I use to look at live site monitoring. This model helps me better see the context of each topic related to monitoring, see where they are in the bigger picture and how they relate to each other.

//the end