惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - 【当耐特】
Help Net Security
Help Net Security
P
Proofpoint News Feed
J
Java Code Geeks
爱范儿
爱范儿
Last Week in AI
Last Week in AI
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
F
Full Disclosure
Google DeepMind News
Google DeepMind News
H
Help Net Security
G
Google Developers Blog
Jina AI
Jina AI
Vercel News
Vercel News
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
L
Lohrmann on Cybersecurity
S
Schneier on Security
Microsoft Azure Blog
Microsoft Azure Blog
IT之家
IT之家
Security Archives - TechRepublic
Security Archives - TechRepublic
阮一峰的网络日志
阮一峰的网络日志
N
News and Events Feed by Topic
GbyAI
GbyAI
B
Blog
O
OpenAI News
博客园_首页
Cisco Talos Blog
Cisco Talos Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
Hacker News: Ask HN
Hacker News: Ask HN
TaoSecurity Blog
TaoSecurity Blog
腾讯CDC
MongoDB | Blog
MongoDB | Blog
M
MIT News - Artificial intelligence
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Webroot Blog
Webroot Blog
Simon Willison's Weblog
Simon Willison's Weblog
Y
Y Combinator Blog
C
Cisco Blogs
A
Arctic Wolf
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
The Exploit Database - CXSecurity.com
Security Latest
Security Latest
AI
AI
W
WeLiveSecurity
aimingoo的专栏
aimingoo的专栏
The Register - Security
The Register - Security
Project Zero
Project Zero
H
Hackread – Cybersecurity News, Data Breaches, AI and More
N
Netflix TechBlog - Medium
Blog — PlanetScale
Blog — PlanetScale

CMTOPS.DEV

CMTOPS.DEV CMTOPS.DEV CMTOPS.DEV CMTOPS.DEV CMTOPS.DEV
CMTOPS.DEV
Timofey Chuchkanov · 2025-09-21 · via CMTOPS.DEV

Analyzing NGINX Logs — SQL Observability pt. 2

~5 min read

Updated on

This post is a short note on my battle-tested dashboard for analyzing NGINX access logs that I use at work.

Contents

  • Intro
  • A Weird Weekend
  • Panels Overview
    • Response Codes Over Time
    • Total Number of Requests
    • Top 10 Countries
    • Error Rate
    • Unparsed Log Lines
    • HTTP Requests by Status Code
    • HTTP Requests by Method
    • Top N URLs
    • Access Logs
  • Implementation Notes
  • Summary
  • Endnotes

Intro

Some time ago, I wrote a long post on building an observability stack at my workplace using ClickHouse and Grafana. [1]

There, I made a promise to write on how I analyze NGINX access logs at “some point in the future.” I didn’t want to write about something I don’t use, or that is half-baked, so I let it unfold organically. Now, the time has come.

The NGINX access dashboard is fairly complete at its current state, and undergoes changes quite rarely. They’re cosmetic, mostly.

This project proved to be useful many times throughout a span of half a year. Which means, it’s not a mere experiment anymore.

To demonstrate how this thing can be useful, let me tell you a little story happened to me.

A Weird Weekend

One Saturday morning, I opened my eyes and habitually checked my phone for notifications. A Zabbix alert — one of the sites is down.

Trying to verify. The website isn’t loading.

What the heck happened?!

Next moment, I found myself with my upper body hanging down the bed, swiftly typing on my laptop placed on the floor. Quickly examined the server — resources usage OK. Restarting PHP, no use.

Logging into Grafana. A sudden traffic spike, tons of server-side errors. An ad campaign? No, doesn’t seem like it.

Changing the time range to grab the last 12 hours.

Huh! Here it is, the errors first appeared at night, but the website didn’t go down completely until now.

Grafana last 12 hours error spikes

Is that a DDoS? But we have an anti-DDoS protection 🤔.

Playing around with dashboard filters, analyzing panels.

Identified the root cause. A bunch of bots sending requests of the format ^/\?[A-Za-z0-9]{4}=[A-Za-z0-9]{4}$.

URL Example

Connecting to ClickHouse using CLI, dumping a list of IP addresses, piping ‘em down AWK formatting as NGINX deny statements. Reloading reverse proxy config…

The site is back in operation. Phew 😤.

Trying to contact colleagues who can reach out to our anti-DDoS protection service. About an hour later, the botnet’s traffic is blocked on the network level.

Monday morning. Dumping Saturday’s DDoS request stats by countries in Markdown format using ClickHouse’s FORMAT statement. Forming a security incident report.

The overall attack lasted about 24 hours, starting at night. Offending requests came from over 20 countries. Most of them were blocked by our anti-DDoS provider.

To be honest, I had a ton of fun despite a rush of adrenaline trying to figure out what’s going on at the time. Moments like this bring unmatched levels of satisfaction to my life.

Love my job ❤️.

Panels Overview

Now, here’s a short summary of what you’ll find on this dashboard.

Response Codes Over Time

Stacked bars visualizing hits colored by HTTP status code. Time binning is dynamic. Response Code Over Time panel

Total Number of Requests

Overall hits. Total Number of Requests panel

Top 10 Countries

A pie chart with the top 10 countries. Top 10 Countries panel

Error Rate

Stacked bars chart with user- and server-side error rate calculated using a errors / hits * 100 formula. Dynamic time binning. Error Rate panel

Unparsed Log Lines

How many log lines the log collector couldn’t process, and ingested as-is. They are determined by a null HTTP status. Unparsed Log Lines panel

HTTP Requests by Status Code

A bar chart depicting a number of requests grouped by an HTTP status code. HTTP Requests by Status Code panel

HTTP Requests by Method

Also a bar chart, but showing a number of HTTP requests grouped by an HTTP method. HTTP Requests by Method panel

Top N URLs

Top N URLs based on the drop-down filter at the top of the dashboard. The range is 5..20 with a step of 5. Top N URLs panel

Access Logs

Last 1000 access logs sorted newest to oldest. It’s in the Logs row at the bottom of the dashboard. Access Logs panel

Implementation Notes

The dashboard contains WHERE filters inside of panels’ queries. They are based on variables found in the dashboard settings.

WHERE
    $__timeFilter(time)
    ...
    AND if('${response_status:raw}' = 'ALL_STATUSES', true, status IN array(${response_status:raw}))
    AND if('${countries:raw}' = 'ALL_COUNTRIES', true, country IN array(${countries}))
    AND if('${url_pattern}' = '', true, match(url, '${url_pattern}'))
    ...

All of those ifs are there for a reason — that’s a performance optimization. Remove them, and queries will take longer.

${} are macros from the Grafana ClickHouse plugin. You can learn more about them on the plugin’s page.

In the ClickHouse table several columns are materialized in order to shift processing from log collector to the database itself.

CREATE TABLE nginx.access
(
    ...
    `_id`         FixedString(26)  DEFAULT      generateULID(),
    `country`     String           MATERIALIZED coalesce(nullIf(lookupIPv4(client, 'country_name'), ''), 'NONE'),
    `url`         String           MATERIALIZED regexpExtract(request, '^([A-Z]+)\\s(.*)\\s+', 2),
    `method`      String           MATERIALIZED regexpExtract(request, '^([A-Z]+)\\s(.*)\\s+', 1)
    ...
)
ENGINE = MergeTree
ORDER BY (timestamp, host, site_name)
TTL timestamp + INTERVAL 15 DAY

👆 The country materialized column is calculated by extracting geoip data as I wrote in this article.

Summary

Now you have a basic idea of how such a dashboard can help you in your daily operations, and can implement it using the code provided in this article.

Endnotes


Thank you for reading, and have a good rest of your day! (^ ~ ^ )

If you have any questions/suggestions, or found an error, contact me!