PostGIS Performance: Data Sampling | Crunchy Data Blog

Vulnerabilities – Threatpost

Cloudbric

GRAHAM CLULEY

Securelist

Schneier on Security

Help Net Security

Exploit-DB.com RSS Feed

Project Zero

Spread Privacy

Privacy International News Feed

Cyber Attacks, Cyber Crime and Cyber Security

Simon Willison's Weblog

Stack Overflow Blog

KPMG report finds enterprise disconnect between AI and its ROI | CIO

CrunchyData Blog

British Columbia, Time Zones, and Postgres | Crunchy Data Blog Postgres Serials Should be BIGINT (and How to Migrate) | Crunchy Data Blog Postgres 18 New Default for Data Checksums and How to Deal with Upgrades | Crunchy Data Blog PostGIS Performance: Simplification | Crunchy Data Blog Postgres Scan Types in EXPLAIN Plans | Crunchy Data Blog

PostGIS Performance: Data Sampling | Crunchy Data Blog

Paul.Ramsey@ · 2025-11-21 · via CrunchyData Blog

One of the temptations database users face, when presented with a huge table of interesting data, is to run queries that interrogate every record. Got a billion measurements? What’s the average of that?!

One way to find out is to just calculate the average.

SELECT avg(value) FROM mytable;

For a billion records, that could take a while!

Fortunately, the “Law of Large Numbers” is here to bail us out, stating that the average of a sample approaches the average of the population, as the sample size grows. And amazingly, the sample does not even have to be particularly large to be quite close.

Here’s a table of 10M values, randomly generated from a normal distribution. We know the average is zero. What will a sample of 10K values tell us it is?

CREATE TABLE normal AS
  SELECT random_normal(0,1) AS values
    FROM generate_series(1,10000000);

We can take a sample using a sort, or using the random() function, but both of those techniques first scan the whole table, which is exactly what we want to avoid.

Instead, we can use the PostgreSQL TABLESAMPLE feature, to get a quick sample of the pages in the table and an estimate of the average.

SELECT avg(values)
  FROM normal TABLESAMPLE SYSTEM (1);

I get an answer – 0.0031, very close to the population average – and it takes just 43 milliseconds.

Can this work with spatial? For the right data, it can. Imagine you had a table that had one point in it for every person in Canada (36 million of them) and you wanted to find out how many people lived in Toronto (or this red circle around Toronto).

SELECT count(*)
  FROM census_people
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The answer is 5,010,266, and it takes 7.2 seconds to return. What if we take a 10% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (10)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 10%, and the answer comes back as 508,292 (near one tenth of our actual measurement) in 2.2 seconds. What about a 1% sample?

SELECT count(*)
  FROM census_people TABLESAMPLE SYSTEM (1)
  JOIN yyz
    ON ST_Intersects(yyz.geom, census_people.geom);

The sample is 1%, and the answer comes back as 50,379 (near one hundredth of our actual measurement) in 0.2 seconds. Still a good estimate!

Is this black magic? No, the TABLESAMPLE SYSTEM mode gets its speed by reading pages randomly. In our last example, it randomly chose 1% of the pages. Here’s what that looks like in Toronto.

See in particular how blotchy the data are in the suburban areas outside the circle. The data in the table are not randomly distributed to the pages, they came from the census data in order, and ended up loaded into the database in order. So for any given database page, the actual rows in the page will tend to be near to one another.

This works for this example because the amount of data is high, and the area we are summarizing is a large proportion of the total data – a seventh of the Canadian population lives in that circle.

If we were summarizing a smaller area, the results would not have been so good.

The TABLESAMPLE SYSTEM is a powerful tool, but you have to be sure that any given page has a random selection of the data you are sampling for. Our random normal example worked perfectly, because the data were perfectly random. A sample of time series data would not work well for sample time windows (the data were probably stored in order of arrival) but might work for sampling some other value.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

CrunchyData Blog