惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

美团技术团队
TaoSecurity Blog
TaoSecurity Blog
Recorded Future
Recorded Future
WordPress大学
WordPress大学
Stack Overflow Blog
Stack Overflow Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
A
Arctic Wolf
Last Week in AI
Last Week in AI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
CERT Recently Published Vulnerability Notes
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Simon Willison's Weblog
Simon Willison's Weblog
博客园 - 聂微东
有赞技术团队
有赞技术团队
GbyAI
GbyAI
博客园 - Franky
D
Docker
H
Hackread – Cybersecurity News, Data Breaches, AI and More
P
Privacy & Cybersecurity Law Blog
AWS News Blog
AWS News Blog
T
The Blog of Author Tim Ferriss
G
GRAHAM CLULEY
宝玉的分享
宝玉的分享
Vercel News
Vercel News
T
Tailwind CSS Blog
P
Palo Alto Networks Blog
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志
S
Securelist
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
C
Check Point Blog
Google DeepMind News
Google DeepMind News
The Cloudflare Blog
A
About on SuperTechFans
T
Threat Research - Cisco Blogs
Jina AI
Jina AI
The Hacker News
The Hacker News
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
S
Security Affairs
V
Visual Studio Blog
Security Archives - TechRepublic
Security Archives - TechRepublic
J
Java Code Geeks
Cyberwarzone
Cyberwarzone
K
Kaspersky official blog
IT之家
IT之家
Project Zero
Project Zero
博客园 - 【当耐特】
Scott Helme
Scott Helme
罗磊的独立博客
人人都是产品经理
人人都是产品经理

博客园 - Lemo_wd

flutter —— iOS 的构建与分发 flutter —— 理解 iOS 的依赖管理 wireguard 入门 k3s 基础 —— 将 traefik 替换为 ingress-nginx Pr 入门 为什么 TCP 是3次握手4次挥手? 给 ssh 私钥文件配置统一口令 滚动部署零碎笔记1 Clickhouse 物化视图与投影 Clickhouse 主键索引 Clickhouse 核心概念 linux Tun/Tap 虚拟网卡 Clickhouse 的 kafka Engine 集成 Clickhouse 表引擎 —— MergeTree 局域网中的设备间的流量转发 d3.js 构建股权架构图并绘制双向节点树 final cut pro 入门 d3.js 构建股权架构图并绘制股权百分比 flutter 的滚动控制 —— 滚动类组件的内部与整体滚动
ClickHouse 案例学习 —— 使用 ClickHouse 探索 GitHub 统计信息
Lemo_wd · 2025-06-09 · via 博客园 - Lemo_wd

来源 使用 ClickHouse 探索 GitHub 统计信息

统计数量与大小

SELECT formatReadableQuantity(sum(rows)) AS rows,
       formatReadableSize(sum(data_compressed_bytes)) AS compressedBytes,
       formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressedBytes
FROM system.parts
WHERE database = 'github' AND table = 'events';

自动分组

-- ClickHouse 会自动识别 SELECT 子句中的非聚合列(在这个例子中是 event_type),然后对这些列进行分组
SELECT event_type, count() AS count, formatReadableQuantity(count)
FROM github.events
GROUP BY ALL

字符串分割函数

-- 按字符串分割
select splitByString('::', 'a::b::c')

-- 按空白字符分割
select splitByWhitespace('hello world test') 

-- 按正则分割
select splitByRegexp('[,;]', 'a,b;c') 

日期转换函数

-- 转成天
SELECT toDate(created_at) AS day, count() AS dailyCount
FROM github.events 
WHERE event_type = 'WatchEvent' AND repo_name = 'deepseek-ai/DeepSeek-R1'
GROUP BY ALL
ORDER BY day;
-- 转成周
SELECT toDayOfWeek(created_at) AS year, event_type, count() AS stars
FROM github.events
GROUP BY ALL
ORDER BY year ASC;

请参考 处理日期和时间的函数
例如:toStartOfYear

窗口函数

-- 窗口函数在聚合完成后执行,可以引用聚合的结果

-- 每个类型的star数量的排名
SELECT toDayOfWeek(created_at) AS dayOfWeek, event_type, 
         count() AS stars, 
         row_number() OVER(PARTITION BY event_type ORDER BY stars DESC) AS rank
  FROM github.events
  WHERE event_type IN (
    'ForkEvent', 'IssuesEvent', 'WatchEvent', 'PushEvent', 'PullRequestEvent'
  )  
  GROUP BY ALL
  ORDER BY dayOfWeek ASC
-- 每个类型中当前star数量与最大star数量的相对大小
WITH eventTypeStars AS (
  SELECT toDayOfWeek(created_at) AS dayOfWeek, event_type, 
         count() AS stars, 
         max(stars) OVER(PARTITION BY event_type) AS maxDay
  FROM github.events
  WHERE event_type IN (
    'ForkEvent', 'IssuesEvent', 'WatchEvent', 'PushEvent', 'PullRequestEvent'
  )  
  GROUP BY ALL
  ORDER BY dayOfWeek ASC
)

SELECT *, stars/maxDay AS ratio
FROM eventTypeStars;

视图

物化视图

-- SummingMergeTree 表
CREATE TABLE repo_stars (
  repo_name String,
  count UInt32
)
ENGINE = SummingMergeTree 
ORDER BY (repo_name)

-- 视图
CREATE MATERIALIZED VIEW github.repo_stars_mv
TO github.repo_stars AS 
SELECT repo_name, count() as count 
FROM github.events
WHERE event_type = 'WatchEvent' 
GROUP BY repo_name

-- 查询(由于后台 merge,需要  sum(count) 
SELECT repo_name
	FROM github.repo_stars
	GROUP BY repo_name
	ORDER BY sum(count) DESC
	LIMIT 1000

可刷新的物化视图

-- MergeTree 表
CREATE TABLE github.top_repos
(
	`repo_name` String,
	`stars` UInt64
)
ENGINE = MergeTree
ORDER BY stars;

-- 视图
CREATE MATERIALIZED VIEW github.top_repos_mv
REFRESH EVERY 30 MINUTES TO github.top_repos
(
	`repo_name` String,
	`stars` Int64
)
AS SELECT
	repo_name,
	sum(count) AS stars
FROM github.repo_events_per_day
WHERE event_type = 'WatchEvent'
GROUP BY ALL;

-- 查询
SELECT repo_name
	FROM github.top_repos
	ORDER BY stars DESC
	LIMIT 1000

投影

在 ClickHouse 中,如果筛选条件与排序键中较早的列相匹配,查询速度会更快。通过给原始表添加投影,可以很方便地添加不同的排序方式,从而加速其查询。

-- 原始表
CREATE TABLE actors_per_repo(	
  `actor_login` LowCardinality(String),
  `repo_name` String,
  `pushes` UInt32,
  `issues` UInt32,
  `stars` UInt32
)
ENGINE = SummingMergeTree
ORDER BY (repo_name, actor_login);

-- 设置
ALTER TABLE github.actors_per_repo  
MODIFY SETTING deduplicate_merge_projection_mode='rebuild';

-- 添加投影(根据 actor_login 排序)
ALTER TABLE github.actors_per_repo 
ADD PROJECTION actor_login_order (
  SELECT * 
  ORDER BY actor_login
);
ALTER TABLE github.actors_per_repo 
MATERIALIZE PROJECTION actor_login_order;

-- 再次查询
SELECT actor_login, splitByChar('/', repo_name)[2] as project, sum(pushes) as pushes, sum(issues) as issues, sum(stars) as stars
FROM github.actors_per_repo
WHERE actor_login = 'alexey-milovidov'
GROUP BY actor_login, project
ORDER BY pushes DESC
LIMIT 5;