惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Forbes - Security
Forbes - Security
GbyAI
GbyAI
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
S
SegmentFault 最新的问题
Y
Y Combinator Blog
Recorded Future
Recorded Future
博客园 - Franky
I
InfoQ
T
The Blog of Author Tim Ferriss
Recent Announcements
Recent Announcements
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
博客园_首页
阮一峰的网络日志
阮一峰的网络日志
T
Tailwind CSS Blog
Cyberwarzone
Cyberwarzone
The Register - Security
The Register - Security
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
雷峰网
雷峰网
P
Palo Alto Networks Blog
G
GRAHAM CLULEY
Cloudbric
Cloudbric
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
MongoDB | Blog
MongoDB | Blog
F
Full Disclosure
Google DeepMind News
Google DeepMind News
Recent Commits to openclaw:main
Recent Commits to openclaw:main
C
Check Point Blog
爱范儿
爱范儿
The GitHub Blog
The GitHub Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
T
Threat Research - Cisco Blogs
U
Unit 42
N
Netflix TechBlog - Medium
The Cloudflare Blog
Spread Privacy
Spread Privacy
Microsoft Azure Blog
Microsoft Azure Blog
美团技术团队
T
Troy Hunt's Blog
Engineering at Meta
Engineering at Meta
H
Heimdal Security Blog
TaoSecurity Blog
TaoSecurity Blog
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tenable Blog
B
Blog
S
Securelist
H
Hacker News: Front Page
Google Online Security Blog
Google Online Security Blog
G
Google Developers Blog

CrunchyData Blog

British Columbia, Time Zones, and Postgres | Crunchy Data Blog Postgres Serials Should be BIGINT (and How to Migrate) | Crunchy Data Blog Postgres 18 New Default for Data Checksums and How to Deal with Upgrades | Crunchy Data Blog PostGIS Performance: Simplification | Crunchy Data Blog Postgres Scan Types in EXPLAIN Plans | Crunchy Data Blog
PostGIS Performance: Data Sampling | Crunchy Data Blog
Paul.Ramsey@ · 2025-11-21 · via CrunchyData Blog

One of the temptations database users face, when presented with a huge table of interesting data, is to run queries that interrogate every record. Got a billion measurements? What’s the average of that?!

One way to find out is to just calculate the average.

SELECT avg(value) FROM mytable;

For a billion records, that could take a while!

Fortunately, the “Law of Large Numbers” is here to bail us out, stating that the average of a sample approaches the average of the population, as the sample size grows. And amazingly, the sample does not even have to be particularly large to be quite close.

Here’s a table of 10M values, randomly generated from a normal distribution. We know the average is zero. What will a sample of 10K values tell us it is?

CREATE TABLE normal AS
  SELECT random_normal(0,1) AS values
    FROM generate_series(1,10000000);

We can take a sample using a sort, or using the random() function, but both of those techniques first scan the whole table, which is exactly what we want to avoid.

Instead, we can use the PostgreSQL TABLESAMPLE feature, to get a quick sample of the pages in the table and an estimate of the average.

SELECT avg(values)
  FROM normal TABLESAMPLE SYSTEM (1);

I get an answer – 0.0031, very close to the population average – and it takes just 43 milliseconds.

Can this work with spatial? For the right data, it can. Imagine you had a table that had one point in it for every person in Canada (36 million of them) and you wanted to find out how many people lived in Toronto (or this red circle around Toronto).

SELECT count(*)
  FROM census_people
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The answer is 5,010,266, and it takes 7.2 seconds to return. What if we take a 10% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (10)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 10%, and the answer comes back as 508,292 (near one tenth of our actual measurement) in 2.2 seconds. What about a 1% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (1)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 1%, and the answer comes back as 50,379 (near one hundredth of our actual measurement) in 0.2 seconds. Still a good estimate!

Is this black magic? No, the TABLESAMPLE SYSTEM mode gets its speed by reading pages randomly. In our last example, it randomly chose 1% of the pages. Here’s what that looks like in Toronto.

See in particular how blotchy the data are in the suburban areas outside the circle. The data in the table are not randomly distributed to the pages, they came from the census data in order, and ended up loaded into the database in order. So for any given database page, the actual rows in the page will tend to be near to one another.

This works for this example because the amount of data is high, and the area we are summarizing is a large proportion of the total data – a seventh of the Canadian population lives in that circle.

If we were summarizing a smaller area, the results would not have been so good.

The TABLESAMPLE SYSTEM is a powerful tool, but you have to be sure that any given page has a random selection of the data you are sampling for. Our random normal example worked perfectly, because the data were perfectly random. A sample of time series data would not work well for sample time windows (the data were probably stored in order of arrival) but might work for sampling some other value.