TimescaleDB Compression: Hypercore and Columnar Storage with up to 98% Ratio in PostgreSQL

Hacker News: Front Page

Trump administration reclassifies cannabis as less dangerous Release raylib v6.0 · raysan5/raylib GitHub - russellromney/honker: SQLite extension + bindings for Postgres NOTIFY/LISTEN semantics with durable queues, streams, pub/sub, and scheduler Writing a C Compiler, in Zig crawshaw - 2026-04-22 MacBook Neo and How the iPad Should Be It's time to reclaim the word "Palantir" for J.R.R. Tolkien Arch Linux now has a bit-for-bit reproducible Docker image Fundamental Theorem of Calculus | David Álvarez Rosa | Personal Website Bring Your Agent to Teams Ars Technica newsroom AI policy France confirms data breach at government agency that manages citizens’ IDs New study compares growing corn for energy to solar production. It's no contest NAEP Long-Term Trend Assessment Results: Reading and Mathematics Convergent Evolution: How Different Language Models Learn Similar Number Representations We Found a Stable Firefox Identifier Linking All Your Private Tor Identities GitHub - besimple-oss/broccoli: Broccoli turns Linear tickets into shipped PRs — powered by Claude and Codex, running on your own Google Cloud. Youth Suicides Declined After Creation of National Hotline Top MAGA influencer revealed to be AI — created by a guy in India who made a mint off lonely men online Ping-pong robot beats top-level human players Announcing DuckDB 1.5.2 The handmade beauty of Machine Age data visualizations Treetops glowing during storms captured on film for first time Columnar Storage is Normalization TPU 8t and TPU 8i technical deep dive Our eighth generation TPUs: two chips for the agentic era Introducing Google Cloud Fraud Defense, the next evolution of reCAPTCHA | Google Cloud Blog Kernel code removals driven by LLM-created security reports tante.cc Nobody Got Fired for Uber's $8 Million Ledger Mistake? Introducing workspace agents in ChatGPT Sure, xor’ing a register with itself is the idiom for zeroing it out, but why not sub? What Async Promised and What it Delivered — Causality GitHub - justrach/kuri: Browser automation and web crawling for AI agents. Zig-native, token-efficient CDP snapshots, HAR recording, and a standalone fetcher. Drunk Post: Things I’ve Learned as a Senior Engineer Claude Code to be removed from Anthropic's Pro plan? Another Day Has Come 'Something sinister could be happening': FBI looks into dead or missing nuclear and space defense scientists tied to NASA, Blue Origin, and SpaceX | Fortune GitHub - calcom/cal.diy: Scheduling infrastructure for absolutely everyone. Meta to start capturing employee mouse movements, keystrokes for AI training The Vercel Breach: OAuth Supply Chain Attack Exposes the Hidden Risk in Platform Environment Variables Member of Technical Staff, Product Engineering (full-time) at Trellis AI | Y Combinator CATL's new LFP battery can charge from 10 to 98% in less than 7 minutes Jobs at Bloom | Y Combinator The printing press for biological data (Sterling Hooten) Brussels launched an age checking app. Hackers took 2 minutes to break it Inside GitHub's Fake Star Economy The Illuminated Man by Christopher Priest and Nina Allan review – an unconventional portrait of JG Ballard IEA: Solar overtakes all energy sources in a major global first Stripe’s payments APIs: The first 10 years GitHub - esutcu/planb-lpm GitHub - browser-use/browser-harness: Self-healing browser harness that enables LLMs to complete any task. Claude Token Counter, now with model comparisons GitHub - shivampkumar/trellis-mac Six levels of dark mode The Bromine Chokepoint: How Strife in the Middle East Could Halt Production of the World’s Memory Chips Turtle WoW classic server announces shutdown after Blizzard wins injunction Scoring 500 Show HN pages for AI design patterns Vercel April 2026 security incident | Vercel Knowledge Base Dubai police arrest airline worker after accessing private WhatsApp group Prompt → Diagram — Gemma 4 E2B in desktop Chrome (WebGPU) Binary GCD - Algorithmica madhadron - The seven programming ur-languages Keep Pushing: We Get 10 More Days to Reform Section 702 The world in which IPv6 was a good design Zero-Copy GPU Inference from WebAssembly on Apple Silicon The RAM shortage could last years Any Color You Like: NIST Scientists Create ‘Any Wavelength’ Lasers in Tiny Circuits for Light Optimizing Ruby Path Methods A college instructor turns to typewriters to curb AI-written work and teach life lessons UpCodes | Careers The electromechanical angle computer inside the B-52 bomber's star tracker Why Japan has such good railways - Works in Progress Magazine State of Kdenlive - 2026 GitHub - smol-machines/smolvm: Tool to build & run portable, lightweight, self-contained virtual machines. Head of Engineering at Kyber | Y Combinator GitHub - paniclock/paniclock: Instantly disable Touch ID and lock your Mac with one click or keyboard shortcut. Detecting DOSBox from within the Box I Measured Claude 4.7's New Tokenizer. Here's What It Costs You. Introducing Claude Design by Anthropic Labs Middle schooler finds coin from Troy in Berlin It Is Time to Ban the Sale of Precise Geolocation Isaac Asimov: The Last Question Teddy Roosevelt and Abraham Lincoln in the same photo Healthchecks.io Now Uses Self-hosted Object Storage Bluesky has been dealing with a DDoS attack for nearly a full day. Harness Engineer at Substrate | Y Combinator GitHub - dacracot/Klondike3-Simulator SPICE simulation → oscilloscope → verification with Claude Code — Lucas Gerads Email could have been X.400 times better Newly unsealed records reveal Amazon’s price-fixing tactics, California attorney general claims GitHub - GainSec/AutoProber: Hardware hacker’s flying probe automation stack for agent-driven target discovery, microscope mapping, safety-monitored CNC motion, probe review, and controlled pin probing. A Better R Programming Experience Thanks to Tree-sitter Clojure - Documentary GPT‑Rosalind for life sciences research How a Tiny Yellow Handheld Changed How Duke University Teaches Game Design - Playdate News Android CLI and skills: Build Android apps 3x faster using any agent Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 Codex for almost everything GitHub - GRVYDEV/marky: A lightweight easy to use markdown viewer

AWS, Kubernetes & Cloud Security Experts – IT Consulting | Roszi · 2026-06-16 · via Hacker News: Front Page

TimescaleDB can achieve compression of up to 98% for typical time-series data. Compressing time-series data requires a fundamentally different approach than the general-purpose algorithms used in OLTP databases. In TimescaleDB this is handled by the hypercore engine — a hybrid row-columnar engine that uses specialized algorithms: delta encoding, delta-of-delta, Gorilla XOR and run-length encoding. This article explains how it works and how to configure compression so that you actually achieve that ratio.

TimescaleDB compression - how it differs from PostgreSQL TOAST

PostgreSQL has a built-in mechanism called TOAST (The Oversized-Attribute Storage Technique), but TimescaleDB compression solves a fundamentally different problem. TOAST deals with individual large values (long strings, jsonb, bytea), whereas TimescaleDB compression optimizes cross-row patterns in time-series data. The two mechanisms are complementary, not competing — TimescaleDB even uses TOAST internally as a fallback for certain data types. PostgreSQL uses a fixed “page size”, typically 8 kB, and does not allow tuples to span multiple pages. For that reason, when field values are very large, the data must be compressed and/or split across multiple physical rows.

Feature	TOAST (vanilla PostgreSQL)	TimescaleDB hypercore
Design goal	Individual values > 2 KB	Cross-row patterns in time-series
Trigger	Row exceeds `TOAST_TUPLE_THRESHOLD` (~2 KB)	Per-chunk policy (e.g. older than 7 days)
Supported types	Variable-length only (`text`, `jsonb`, `bytea`, `numeric`)	All data types
Algorithms	`pglz` (default), `lz4` (since PG14, opt-in)	Combination: delta encoding, delta-of-delta, simple-8b, run-length encoding, XOR-based, dictionary compression
Compression granularity	Per value (1 value = 1 byte stream)	Per batch (~1000 rows together)
Exploiting data structure	No - treats values as opaque bytes	Yes - exploits numeric structure, monotonicity, repetition
Typical ratio for sensor floats	~1.0× (no compression)	10-20×
Typical ratio for timestamps	~1.0× (no compression - fixed-length type)	50-100× (delta-of-delta for regular intervals)
Typical ratio for text	2-3× (general-purpose LZ)	5-10× (dictionary + RLE if repetitive)

The table shows the scale of the difference. For a typical IoT workload with floats and timestamps — i.e. the columns TOAST does not compress at all — TimescaleDB reaches a ratio of 10-100×, because it is built for this type of data.

The Hypercore engine and columnar compression

In TimescaleDB, compression is handled by an engine called hypercore — a hybrid row-columnar engine in which new data lands in Postgres row-based chunks (fast INSERTs and UPDATEs), while older chunks are automatically converted to a columnar, compressed format. Analytical queries that read this compressed data read fewer bytes and run faster. This conversion enables compression of up to 98%, which significantly lowers storage costs in projects with long data retention. Unlike traditional row-based storage, where data is stored sequentially by row, columnar storage organizes and compresses data by column. As a result, queries can fetch only the necessary fields in batches instead of scanning entire rows.

What happens to the rows

Converting a chunk groups rows into batches of up to 1000 and each batch becomes a single row in the compressed table, in which the columns are arrays.

Each compressed batch:

Encapsulates columnar data in compressed arrays of up to 1000 values per column, stored as a single entry in the compressed table.
Uses a column-major format inside the batch, which enables efficient scans by colocating values of the same column and lets you select individual columns without reading the entire batch.
Applies advanced column-level compression techniques — run-length encoding, delta encoding, Gorilla compression — reducing storage and improving I/O.

Source: https://www.tigerdata.com/docs/learn/deep-dive/whitepaper#data-model

An example of compression using delta encoding:

time	machine_id	sensor_type	value
12:00:00	MACHINE_001	temp	72.5
12:00:00	MACHINE_001	speed	2.0
12:00:05	MACHINE_001	temp	72.7
12:00:05	MACHINE_001	speed	2.1
12:00:10	MACHINE_001	temp	72.4
12:00:10	MACHINE_001	speed	2.4

With delta encoding you only need to store how much each value changed relative to the previous data point, which means smaller values to store. After the first row, you can represent the following rows using less information, for example:

time	machine_id	sensor_type	value
12:00:00	MACHINE_001	temp	72.5
0 seconds	MACHINE_001	speed	2.0
5 seconds	MACHINE_001	temp	+0.2
0 seconds	MACHINE_001	speed	+0.1
5 seconds	MACHINE_001	temp	-0.3
0 seconds	MACHINE_001	speed	+0.3

In time-series data it is often the case that certain values repeat for some period. For example, if you have a temperature sensor that reads 72.5 degrees for 10 minutes, then suddenly rises to 73.0 degrees and stays there for another 10 minutes, you can use delta-of-delta encoding. If the interval is constant (e.g. always 5 seconds), the delta-of-delta is 0 and can be stored in a very small number of bits.

time	machine_id	sensor_type	value
12:00:00	MACHINE_001	temp	72.5
+5 seconds	MACHINE_001	temp	+0.2
0 seconds	MACHINE_001	temp	-0.3
0 seconds	MACHINE_001	temp	+0.3
0 seconds	MACHINE_001	temp	-0.1

Delta encoding works great for numeric values that change by small amounts, but time-series data also often contains columns in which the same value repeats across many consecutive rows — for example machine_id, sensor_type or device status. In such cases run-length encoding (RLE) is used, which instead of storing the same value repeatedly stores it once together with the number of repetitions.

Data before compression:

time	machine_id	sensor_type	value
12:00:00	MACHINE_001	temp	72.5
12:00:05	MACHINE_001	temp	72.7
12:00:10	MACHINE_001	temp	72.4
12:00:15	MACHINE_001	temp	72.6
12:00:20	MACHINE_001	temp	72.5

After applying RLE to the machine_id and sensor_type columns:

machine_id	sensor_type
MACHINE_001 × 5	temp × 5

Instead of five copies of the string MACHINE_001 (~55 bytes) we store a single value plus a counter (~15 bytes). With millions of rows sharing the same machine_id value, the savings are enormous.

In the end it will look like this:

column	technique	representation after compression
`time`	delta-of-delta	`12:00:00`, `+5s`, `0`, `0`, `0`
`machine_id`	run-length encoding	`MACHINE_001 × 5`
`sensor_type`	run-length encoding	`temp × 5`
`value`	delta encoding	`72.5`, `+0.2`, `-0.3`, `+0.2`, `-0.1`

There are many other methods used in TimescaleDB, which you can read about in the official documentation.

Compression is not “one size fits all” — TimescaleDB picks the algorithm per column type, which is key to understanding why the ratio varies so much between schemas:

Integers, timestamps, booleans and integer-like types — a combination of delta encoding, delta-of-delta, simple-8b and run-length encoding. Delta-of-delta produces small numbers (for regular intervals — all zeros), and simple-8b then physically packs those small numbers into a few bits per value. A similar approach (delta-of-delta for timestamps) is used by Facebook’s Gorilla algorithm.
Columns without many repetitions (e.g. floats from temperature and vibration measurements) — XOR-based compression (based on Gorilla) with a touch of dictionary compression. XORing neighboring floats yields a result with a long run of leading and trailing zeros when the values are similar — then you only need to store the middle “significant” bits instead of the full 64.
JSONB — two layers: first dictionary (when values repeat), and if there are no repetitions, a fallback to PostgreSQL TOAST (pglz by default, lz4 if configured).
Everything else (strings, more unusual types) — dictionary compression. The dictionary indexes also go through simple-8b + RLE, so the compression is two-stage.

That is why sensor_type in the form 'TEMPERATURE'/'SPEED'/'PRESSURE' compresses brilliantly (a 3-element dictionary plus RLE on the indexes), monotonically increasing time drops to almost zero bytes per value, while a high-entropy column such as a per-row UUID will be much worse — a dictionary helps little, because every value is unique, so the dictionary is just as large as the original data. TimescaleDB detects this case and simply does not use a dictionary then.

`segmentby` and `orderby` — the most important parameters

These are the two parameters you have to choose deliberately, because they determine how rows are grouped into batches before compression.

segmentby — the column whose values are shared across an entire batch (e.g. machine_id or sensor_id). The value is stored once per batch, not as an array. In addition, the planner uses segmentby metadata to skip entire batches that do not match the WHERE clause.
orderby — the sort order inside the batch (usually time DESC). Sorting by time gives delta encoding and delta-of-delta their maximum advantage — neighboring values are close to each other, so the differences are small and pack into a few bits.

ALTER TABLE iot_sensor_data SET (
  timescaledb.orderby = 'time DESC',
  timescaledb.segmentby = 'machine_id'
);

Queries with a WHERE machine_id = '...' AND time BETWEEN ... filter on a table configured this way can be an order of magnitude faster than without segmentby, because the planner skips other machines’ batches based on metadata — without touching the data itself.

TimescaleDB packs rows into batches of ~1000 and compresses each batch separately. If segmentby has too high a cardinality (e.g. segmentby = sensor_id with thousands of sensors in IoT, where each sensor has only a few rows per chunk), then each “segment” in the chunk has too few rows, the batches are underfilled and compression is ineffective — the delta/XOR encoders need a series of similar values to compress anything.

The official rule from the documentation: each segment should contain at least 100 rows in a chunk, and optimally 100–10,000 unique segmentby values per chunk.

What does compression do to query performance?

A common question: does compression slow down queries?

Short answer: for typical time-series queries — it speeds them up.

Queries that speed up (the majority of the workload):

Range scans over time with aggregation (SUM, AVG, MAX per time bucket)
Queries with a filter on the segmentby column
Sequential scans over large ranges

Columnar compression reduces I/O by 10-20×. A query reading 1 GB uncompressed vs 100 MB compressed = fewer disk reads, less memory, less CPU for deserialization.

Queries that slow down (rare in time-series):

Point lookup of a single row (WHERE time = '...' AND id = X)
UPDATE/DELETE on compressed chunks (decompress→modify→recompress cycle)
Queries without a filter on segmentby when that column has high cardinality

How to implement it

-- Columnstore configuration for IoT sensor monitoring
ALTER TABLE iot_sensor_data SET (
  timescaledb.compress,
  timescaledb.segmentby = 'machine_id',
  timescaledb.orderby = 'time DESC'
);

-- Policy for automatically converting chunks older than 7 days
SELECT add_columnstore_policy('iot_sensor_data', after => INTERVAL '7 days');

-- Verification
SELECT * FROM chunks_detailed_size('iot_sensor_data');

-- What is compressed
SELECT chunk_name, is_compressed, range_start,
       pg_size_pretty(total_bytes) AS size
FROM timescaledb_information.chunks c
       JOIN chunks_detailed_size('iot_sensor_data') cds USING (chunk_schema, chunk_name)
WHERE hypertable_name = 'iot_sensor_data'
  AND is_compressed = true
ORDER BY range_start;

Example from a real database

In my mqtt_data table I have ~180 unique id values with 4,000–113,000 rows each, depending on the chunk. Configuration:

ALTER TABLE mqtt_data SET (
    timescaledb.enable_columnstore = true,
    timescaledb.segmentby = 'id',
    timescaledb.orderby   = 'time DESC'
);

The effect — the same query on a rowstore vs columnstore chunk

A production-type query, “point read by id and a narrow time range”:

SELECT *
FROM mqtt_data
WHERE time >= '...'::timestamptz
  AND time <  '...'::timestamptz + interval '5 minutes'
  AND id = 'Site1.Machine1.SPEED'
ORDER BY time DESC
LIMIT 10;

Metric	Rowstore (chunk 47, 2.3 GB)	Columnstore (chunk 46, 7.2 MB)
Execution time	10.2 ms	0.36 ms
Planning time	19.0 ms	1.9 ms
Total	29.2 ms	2.3 ms
Speed-up	—	~12.7× total / 28× execution
Data compression ratio	—	42.8× (308 MB → 7.2 MB)

The compressed chunk is ~42× smaller on disk (same data; per-chunk B-tree indexes disappear in columnstore, so the real saving is even larger) and at the same time 28× faster in execution. This is not an error — it is the result of three things working together.

Query plan — let’s show it with `EXPLAIN ANALYZE`

Columnstore chunk (after compression)

Limit  (actual time=0.058..0.259 rows=10 loops=1)
  ->  Custom Scan (ChunkAppend) on mqtt_data
        Order: mqtt_data.time DESC
        ->  Index Scan using _hyper_1_47_chunk_mqtt_data_time_idx
              on _hyper_1_47_chunk  (rowstore, latest)
        ->  Custom Scan (DecompressChunk) on _hyper_1_46_chunk  (never executed)
              Vectorized Filter: ((time >= '...') AND (time < '...'))
              ->  Index Scan using compress_hyper_28_823_chunk_id__ts_meta_min_1__ts_meta_max__idx
                    Index Cond: ((id = 'Site1.Machine1.ERROR')
                             AND (_ts_meta_min_1 < '...')
                             AND (_ts_meta_max_1 >= '...'))
Planning Time: 1.912 ms
Execution Time: 0.363 ms

The index TimescaleDB built for the columnstore is (id, _ts_meta_min_1, _ts_meta_max_1). It was created automatically — it was not defined by hand. Simply because id is segmentby and time is orderby.

Rowstore chunk (before compression)

Limit  (actual time=3.562..10.076 rows=10 loops=1)
  ->  Custom Scan (ChunkAppend) on mqtt_data
        Order: mqtt_data.time DESC
        ->  Index Scan using _hyper_1_47_chunk_mqtt_data_id_time_idx
              on _hyper_1_47_chunk
              Index Cond: ((id = 'Site1.Machine1.ERROR')
                        AND (time >= '...')
                        AND (time < '...'))
Planning Time: 19.014 ms
Execution Time: 10.217 ms

A classic Index Scan over mqtt_data_id_time_idx (~750 MB B-tree for this chunk). It works, but slower, because:

The index does not fit in cache
The planner has to read larger statistics
Postgres iterates row by row

Why does columnstore win on speed too?

1. Sparse minmax index on the meta columns

TimescaleDB itself builds an index on (segmentby_col, _ts_meta_min_1, _ts_meta_max_1) — where min/max are the extreme orderby values per 1000-row batch. This lets it eliminate entire batches without reading them, checking only the meta.

2. Segmentby as a native filter

Rows with the same id are physically grouped together. The index hits the right segment immediately — there is no need for a separate B-tree on (id, time). Segmentby handles this “for free”, as a side effect of the data layout.

3. Vectorized execution

Operations on time ranges run in batches (1000 rows at a time) instead of row by row, as in a classic Index Scan.

Important caveats

The numbers apply to a specific use case — “point read by id and a narrow time range”. For queries aggregating full months the difference will be different. For queries scanning without an id filter, columnstore can be slower than a well-indexed rowstore.
42× is my dataset. MQTT sensor data has exceptionally high redundancy — values change smoothly (Gorilla works great), topics/units repeat within an id (dictionary encoding at its maximum). A realistic expectation for typical time-series: 8–20×.
Fresh chunks stay in rowstore — the policy only converts chunks older than the after => interval. Queries on current data (e.g. the last 5 minutes) are not touched by hypercore.

How to check whether `segmentby` will work for your data?

-- Distribution inside a specific chunk
WITH per_id AS (
    SELECT id, count(*) AS n
    FROM _timescaledb_internal._hyper_X_Y_chunk
    GROUP BY id
)
SELECT
    count(*) FILTER (WHERE n < 100)  AS ids_under_100_rows,
    count(*) FILTER (WHERE n < 1000) AS ids_under_1000_rows,
    count(*) AS total_ids
FROM per_id;

If ids_under_100_rows = 0 and total_ids falls within 100–10,000 → a good segmentby.

If most id values have <100 rows → change your strategy.

Summary

If you are planning a PostgreSQL TimescaleDB deployment - especially for IoT applications, production monitoring or time-series in financial systems - and you want to make sure the compression ratio will be 15×, not 2×, at RoszigIT we design and deploy the Grafana + TimescaleDB + AWS stack for industry. Get in touch if you need an opinion or direct support with the architecture.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hacker News: Front Page

TimescaleDB compression - how it differs from PostgreSQL TOAST

The Hypercore engine and columnar compression

What happens to the rows

segmentby and orderby — the most important parameters

What does compression do to query performance?

How to implement it

Example from a real database

The effect — the same query on a rowstore vs columnstore chunk

Query plan — let’s show it with EXPLAIN ANALYZE

Columnstore chunk (after compression)

Rowstore chunk (before compression)

Why does columnstore win on speed too?

1. Sparse minmax index on the meta columns

2. Segmentby as a native filter

3. Vectorized execution

Important caveats

How to check whether segmentby will work for your data?

Summary

`segmentby` and `orderby` — the most important parameters

Query plan — let’s show it with `EXPLAIN ANALYZE`

How to check whether `segmentby` will work for your data?