惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

D
Darknet – Hacking Tools, Hacker News & Cyber Security
V
Vulnerabilities – Threatpost
Cloudbric
Cloudbric
G
GRAHAM CLULEY
S
Securelist
Schneier on Security
Schneier on Security
Help Net Security
Help Net Security
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Project Zero
Project Zero
Spread Privacy
Spread Privacy
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
Cisco Talos Blog
Cisco Talos Blog
T
Tailwind CSS Blog
博客园_首页
有赞技术团队
有赞技术团队
Simon Willison's Weblog
Simon Willison's Weblog
Stack Overflow Blog
Stack Overflow Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
Latest news
Latest news
T
Tor Project blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Attack and Defense Labs
Attack and Defense Labs
www.infosecurity-magazine.com
www.infosecurity-magazine.com
O
OpenAI News
J
Java Code Geeks
T
Tenable Blog
K
Kaspersky official blog
AWS News Blog
AWS News Blog
S
Security @ Cisco Blogs
The GitHub Blog
The GitHub Blog
T
Threatpost
月光博客
月光博客
H
Heimdal Security Blog
Security Latest
Security Latest
The Hacker News
The Hacker News
Y
Y Combinator Blog
A
Arctic Wolf
Apple Machine Learning Research
Apple Machine Learning Research
C
Cisco Blogs
美团技术团队
Microsoft Security Blog
Microsoft Security Blog
Hugging Face - Blog
Hugging Face - Blog
T
The Blog of Author Tim Ferriss
C
CERT Recently Published Vulnerability Notes
D
Docker
Google Online Security Blog
Google Online Security Blog
D
DataBreaches.Net
V
Visual Studio Blog
H
Help Net Security

CrunchyData Blog

British Columbia, Time Zones, and Postgres | Crunchy Data Blog Postgres Serials Should be BIGINT (and How to Migrate) | Crunchy Data Blog Postgres 18 New Default for Data Checksums and How to Deal with Upgrades | Crunchy Data Blog PostGIS Performance: Simplification | Crunchy Data Blog Postgres Scan Types in EXPLAIN Plans | Crunchy Data Blog
PostGIS Performance: Data Sampling | Crunchy Data Blog
Paul.Ramsey@ · 2025-11-21 · via CrunchyData Blog

One of the temptations database users face, when presented with a huge table of interesting data, is to run queries that interrogate every record. Got a billion measurements? What’s the average of that?!

One way to find out is to just calculate the average.

SELECT avg(value) FROM mytable;

For a billion records, that could take a while!

Fortunately, the “Law of Large Numbers” is here to bail us out, stating that the average of a sample approaches the average of the population, as the sample size grows. And amazingly, the sample does not even have to be particularly large to be quite close.

Here’s a table of 10M values, randomly generated from a normal distribution. We know the average is zero. What will a sample of 10K values tell us it is?

CREATE TABLE normal AS
  SELECT random_normal(0,1) AS values
    FROM generate_series(1,10000000);

We can take a sample using a sort, or using the random() function, but both of those techniques first scan the whole table, which is exactly what we want to avoid.

Instead, we can use the PostgreSQL TABLESAMPLE feature, to get a quick sample of the pages in the table and an estimate of the average.

SELECT avg(values)
  FROM normal TABLESAMPLE SYSTEM (1);

I get an answer – 0.0031, very close to the population average – and it takes just 43 milliseconds.

Can this work with spatial? For the right data, it can. Imagine you had a table that had one point in it for every person in Canada (36 million of them) and you wanted to find out how many people lived in Toronto (or this red circle around Toronto).

SELECT count(*)
  FROM census_people
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The answer is 5,010,266, and it takes 7.2 seconds to return. What if we take a 10% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (10)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 10%, and the answer comes back as 508,292 (near one tenth of our actual measurement) in 2.2 seconds. What about a 1% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (1)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 1%, and the answer comes back as 50,379 (near one hundredth of our actual measurement) in 0.2 seconds. Still a good estimate!

Is this black magic? No, the TABLESAMPLE SYSTEM mode gets its speed by reading pages randomly. In our last example, it randomly chose 1% of the pages. Here’s what that looks like in Toronto.

See in particular how blotchy the data are in the suburban areas outside the circle. The data in the table are not randomly distributed to the pages, they came from the census data in order, and ended up loaded into the database in order. So for any given database page, the actual rows in the page will tend to be near to one another.

This works for this example because the amount of data is high, and the area we are summarizing is a large proportion of the total data – a seventh of the Canadian population lives in that circle.

If we were summarizing a smaller area, the results would not have been so good.

The TABLESAMPLE SYSTEM is a powerful tool, but you have to be sure that any given page has a random selection of the data you are sampling for. Our random normal example worked perfectly, because the data were perfectly random. A sample of time series data would not work well for sample time windows (the data were probably stored in order of arrival) but might work for sampling some other value.