惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

有赞技术团队
有赞技术团队
MyScale Blog
MyScale Blog
The Hacker News
The Hacker News
Google DeepMind News
Google DeepMind News
The Cloudflare Blog
GbyAI
GbyAI
Vercel News
Vercel News
量子位
Apple Machine Learning Research
Apple Machine Learning Research
Recent Announcements
Recent Announcements
美团技术团队
D
DataBreaches.Net
H
Help Net Security
大猫的无限游戏
大猫的无限游戏
人人都是产品经理
人人都是产品经理
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Y
Y Combinator Blog
S
Secure Thoughts
S
SegmentFault 最新的问题
The Last Watchdog
The Last Watchdog
Jina AI
Jina AI
Security Archives - TechRepublic
Security Archives - TechRepublic
F
Fortinet All Blogs
C
Check Point Blog
小众软件
小众软件
阮一峰的网络日志
阮一峰的网络日志
Schneier on Security
Schneier on Security
MongoDB | Blog
MongoDB | Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Stack Overflow Blog
Stack Overflow Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Hacker News: Ask HN
Hacker News: Ask HN
博客园 - 【当耐特】
Simon Willison's Weblog
Simon Willison's Weblog
Scott Helme
Scott Helme
S
Security @ Cisco Blogs
SecWiki News
SecWiki News
Hugging Face - Blog
Hugging Face - Blog
博客园 - 叶小钗
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Google Online Security Blog
Google Online Security Blog
S
Securelist
L
LINUX DO - 最新话题
Forbes - Security
Forbes - Security
D
Darknet – Hacking Tools, Hacker News & Cyber Security
I
InfoQ
Engineering at Meta
Engineering at Meta

博客园 - 高天蒲

前端 mysql常见问题 互联网数据的挖掘和分析 针对wordpress的二次开发 datejs lib awk python 复制文件 PHP笔记 grep 笔记 svn笔记 网页内容爬取:如何提取正文内容 快速开发的要素 Javascript 类、命名空间、代码组织 武士与黑客 ImageMagick Errors: Convert PDF to Images web.py学习随笔 HTML5 and CSS3 做前端的一些小工具 提高勾搭的成功几率(瞎写的,打扰了)
Optimal Logging
高天蒲 · 2013-09-22 · via 博客园 - 高天蒲

How long does it take to find the root cause of a failure in your system? Five minutes? Five days? If you answered close to five minutes, it’s very likely that your production system and tests have great logging. All too often, seemingly unessential features like logging, exception handling, and (dare I say it) testing are an implementation afterthought. Like exception handling and testing, you really need to have a strategy for logging in both your systems and your tests. Never underestimate the power of logging. With optimal logging, you can even eliminate the necessity for debuggers. Below are some guidelines that have been useful to me over the years.

Channeling Goldilocks

Never log too much. Massive, disk-quota burning logs are a clear indicator that little thought was put in to logging. If you log too much, you’ll need to devise complex approaches to minimize disk access, maintain log history, archive large quantities of data, and query these large sets of data. More importantly, you’ll make it very difficult to find valuable information in all the chatter.

The only thing worse than logging too much is logging too little. There are normally two main goals of logging: help with bug investigation and event confirmation. If your log can’t explain the cause of a bug or whether a certain transaction took place, you are logging too little.

Good things to log: 

  • Important startup configuration
  • Errors
  • Warnings
  • Changes to persistent data
  • Requests and responses between major system components
  • Significant state changes
  • User interactions
  • Calls with a known risk of failure
  • Waits on conditions that could take measurable time to satisfy
  • Periodic progress during long-running tasks
  • Significant branch points of logic and conditions that led to the branch
  • Summaries of processing steps or events from high level functions - Avoid logging every step of a complex process in low-level functions.

Bad things to log: 

  • Function entry - Don’t log a function entry unless it is significant or logged at the debug level.
  • Data within a loop - Avoid logging from many iterations of a loop. It is OK to log from iterations of small loops or to log periodically from large loops.
  • Content of large messages or files - Truncate or summarize the data in some way that will be useful to debugging.
  • Benign errors - Errors that are not really errors can confuse the log reader. This sometimes happens when exception handling is part of successful execution flow.
  • Repetitive errors - Do not repetitively log the same or similar error. This can quickly fill a log and hide the actual cause. Frequency of error types is best handled by monitoring. Logs only need to capture detail for some of those errors.

There is More Than One Level

Don't log everything at the same log level. Most logging libraries offer several log levels, and you can enable certain levels at system startup. This provides a convenient control for log verbosity.

The classic levels are: 

  • Debug - verbose and only useful while developing and/or debugging.
  • Info - the most popular level.
  • Warning - strange or unexpected states that are acceptable.
  • Error - something went wrong, but the process can recover.
  • Critical - the process cannot recover, and it will shutdown or restart.

Practically speaking, only two log configurations are needed: 

  • Production - Every level is enabled except debug. If something goes wrong in production, the logs should reveal the cause.
  • Development & Debug - While developing new code or trying to reproduce a production issue, enable all levels.

Test Logs Are Important Too

Log quality is equally important in test and production code. When a test fails, the log should clearly show whether the failure was a problem with the test or production system. If it doesn't, then test logging is broken.

Test logs should always contain: 

  • Test execution environment
  • Initial state
  • Setup steps
  • Test case steps
  • Interactions with the system
  • Expected results
  • Actual results
  • Teardown steps

Conditional Verbosity With Temporary Log Queues

When errors occur, the log should contain a lot of detail. Unfortunately, detail that led to an error is often unavailable once the error is encountered. Also, if you’ve followed advice about not logging too much, your log records prior to the error record may not provide adequate detail. A good way to solve this problem is to create temporary, in-memory log queues. Throughout processing of a transaction, append verbose details about each step to the queue. If the transaction completes successfully, discard the queue and log a summary. If an error is encountered, log the content of the entire queue and the error. This technique is especially useful for test logging of system interactions.

Failures and Flakiness Are Opportunities

When production problems occur, you’ll obviously be focused on finding and correcting the problem, but you should also think about the logs. If you have a hard time determining the cause of an error, it's a great opportunity to improve your logging. Before fixing the problem, fix your logging so that the logs clearly show the cause. If this problem ever happens again, it’ll be much easier to identify.

If you cannot reproduce the problem, or you have a flaky test, enhance the logs so that the problem can be tracked down when it happens again.

Using failures to improve logging should be used throughout the development process. While writing new code, try to refrain from using debuggers and only use the logs. Do the logs describe what is going on? If not, the logging is insufficient.

Might As Well Log Performance Data

Logged timing data can help debug performance issues. For example, it can be very difficult to determine the cause of a timeout in a large system, unless you can trace the time spent on every significant processing step. This can be easily accomplished by logging the start and finish times of calls that can take measurable time: 

  • Significant system calls
  • Network requests
  • CPU intensive operations
  • Connected device interactions
  • Transactions

Following the Trail Through Many Threads and Processes

You should create unique identifiers for transactions that involve processing across many threads and/or processes. The initiator of the transaction should create the ID, and it should be passed to every component that performs work for the transaction. This ID should be logged by each component when logging information about the transaction. This makes it much easier to trace a specific transaction when many transactions are being processed concurrently.

Monitoring and Logging Complement Each Other

A production service should have both logging and monitoring. Monitoring provides a real-time statistical summary of the system state. It can alert you if a percentage of certain request types are failing, it is experiencing unusual traffic patterns, performance is degrading, or other anomalies occur. In some cases, this information alone will clue you to the cause of a problem. However, in most cases, a monitoring alert is simply a trigger for you to start an investigation. Monitoring shows the symptoms of problems. Logs provide details and state on individual transactions, so you can fully understand the cause of problems.