Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-05-27 · via The Practical Developer

The device status dashboard was the first screen every operator opened in the morning. It showed the latest heartbeat, temperature, and error count for 12,000 IoT devices in a single table. At 9:00 AM it loaded in 820 milliseconds. By 2:00 PM, with more historical data in the events table, it was pushing 1.4 seconds. The query plan showed a WindowAgg over 4.2 million rows, followed by a Filter to keep only row_number = 1. The index was being used. The planner was not stupid. But the query was doing work that did not need to happen.

The team had written what every blog post recommends: a window function with PARTITION BY device_id ORDER BY created_at DESC. It is portable SQL. It is easy to understand. And for this specific pattern (one row per group, sorted by time), it is the second-slowest correct solution you can write in Postgres. The slowest is the correlated subquery.

Postgres has a specialized primitive for exactly this problem. It is called DISTINCT ON. It is not standard SQL. It will not port to MySQL or SQL Server. But if you run Postgres in production and you need the latest row per group, it is the tool you should reach for first. This post shows why it wins, how to index for it, and the three gotchas that will bite you if you skip the details.

The query everyone writes first

The events table looks like this:

CREATE TABLE events (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  device_id uuid NOT NULL,
  temperature numeric(5,2),
  status text NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now()
);

The requirement: show the most recent event for every device. The window-function approach:

SELECT id, device_id, temperature, status, created_at
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY device_id ORDER BY created_at DESC) AS rn
  FROM events
) sub
WHERE rn = 1;

This query scans the table, sorts every partition by created_at DESC, assigns a row number, then filters out every row except the first per device. If you have 12,000 devices and 4.2 million events, the database sorts 4.2 million rows to return 12,000. The sort is the killer. Even with an index on (device_id, created_at), the planner may choose a Seq Scan plus Sort because random index lookups for 4.2 million rows can be slower than reading the table sequentially and sorting in memory.

The correlated subquery is worse:

SELECT e1.*
FROM events e1
WHERE e1.created_at = (
  SELECT MAX(e2.created_at)
  FROM events e2
  WHERE e2.device_id = e1.device_id
);

This runs the subquery once per row, or once per device if the planner is generous. In practice, with four million rows, it is a disaster. Do not use this pattern.

DISTINCT ON: one scan, one row per group

Postgres extends the SQL standard with DISTINCT ON (expression). It keeps the first row for each distinct value of the expression, discarding the rest. The ORDER BY clause controls which row is “first.”

SELECT DISTINCT ON (device_id)
  id, device_id, temperature, status, created_at
FROM events
ORDER BY device_id, created_at DESC;

That is the entire query. No subquery. No window function. No filter predicate.

The semantics are simple: the database scans rows in ORDER BY order. Every time it sees a new device_id, it emits that row and skips every subsequent row with the same device_id. Because the ordering is device_id first and created_at DESC second, the first row per device is the most recent one.

The key performance insight is that this query can be satisfied with an index scan that never sorts. If you have an index on (device_id, created_at DESC), the index is already ordered exactly the way the query needs. Postgres walks the index from the start, emits one row per distinct device_id, and stops after 12,000 rows. It does not touch the other 4.1 million rows. It does not allocate memory for a sort. The difference between 820 ms and 8 ms is often this single rewrite.

The index that makes it work

Without the right index, DISTINCT ON is not automatically fast. The planner must still sort the data to group by device_id in the order the query specifies.

CREATE INDEX idx_events_device_created
ON events (device_id, created_at DESC);

Note the DESC. If your ordering is created_at DESC but your index is (device_id, created_at ASC), Postgres can still use the index by scanning backward, but the plan is slightly more complex and the optimizer may hesitate. Match the index direction to the query direction.

Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) on both versions. The window-function plan looks like this:

WindowAgg  (cost=... rows=4200000)
  ->  Index Scan using idx_events_device_created on events
        (cost=... rows=4200000)

Even with the index, the database reads every row. The WindowAgg node buffers the partition, sorts it, assigns numbers, and filters.

The DISTINCT ON plan with the same index looks like this:

Unique  (cost=... rows=12000)
  ->  Index Scan using idx_events_device_created on events
        (cost=... rows=12000)

The Unique node is a streaming deduplicator. It reads the index in order, keeps a running memory of the last-seen device_id, and emits rows only when the value changes. After 12,000 distinct values, the scan stops (or continues only if the query has a LIMIT, which DISTINCT ON composes with naturally).

Check the Buffers: line in EXPLAIN (ANALYZE, BUFFERS). The window function version will show tens of thousands of shared buffer hits. The DISTINCT ON version will show hundreds. That is the difference between a query that saturates your buffer cache and one that barely disturbs it.

Partial indexes for soft deletes and stale data

Most production tables have rows you do not care about: soft-deleted records, events older than a retention window, or devices that have been decommissioned. If your dashboard only shows active devices, indexing every row is wasted space and slower maintenance.

A partial index targets exactly the rows the query needs:

CREATE INDEX idx_events_active_device_created
ON events (device_id, created_at DESC)
WHERE deleted_at IS NULL;

And the query should include the same predicate:

SELECT DISTINCT ON (device_id)
  id, device_id, temperature, status, created_at
FROM events
WHERE deleted_at IS NULL
ORDER BY device_id, created_at DESC;

The partial index is smaller, faster to scan, and faster to maintain during inserts and updates. If your dashboard query is the only thing that needs this specific ordering, a partial index is often the right call.

For events with a retention window, add a time-based predicate:

CREATE INDEX idx_events_recent_device_created
ON events (device_id, created_at DESC)
WHERE created_at > now() - interval '30 days';

Match the query predicate exactly. If the query says WHERE created_at > now() - interval '30 days', the planner will use the partial index. If the query omits the predicate, the planner will fall back to the broader index or a sequential scan.

The three gotchas

DISTINCT ON is powerful, but it has sharp edges. Most production bugs with this syntax come from one of three mistakes.

Gotcha 1: ORDER BY must start with the DISTINCT ON expression.

This is not optional. The ORDER BY clause must lead with the same expression (or expressions, in the same order) as DISTINCT ON, followed by the sorting you want within each group.

-- CORRECT
SELECT DISTINCT ON (device_id) *
FROM events
ORDER BY device_id, created_at DESC;

-- WRONG: will error
SELECT DISTINCT ON (device_id) *
FROM events
ORDER BY created_at DESC;

The error is clear: SELECT DISTINCT ON expressions must match initial ORDER BY expressions. But if you are generating SQL dynamically from an ORM or a query builder, it is easy to produce invalid output. Always assert that your generator prepends the distinct expression to the order clause.

Gotcha 2: NULLs group together, and NULL ordering matters.

If device_id can be NULL (it should not be, but schemas drift), all NULL values form a single group. DISTINCT ON returns one row for the entire NULL group, not one NULL row per some other key. The row it returns depends on the second-level sort order.

If you need deterministic behavior with NULLs, be explicit:

SELECT DISTINCT ON (device_id)
  id, device_id, temperature, status, created_at
FROM events
ORDER BY device_id NULLS LAST, created_at DESC;

Gotcha 3: DISTINCT ON gives you one row per group, not N rows per group.

This is the most common confusion. If your dashboard needs the latest three events per device, DISTINCT ON cannot help you. It is strictly a “top 1 per group” tool. For “top N per group,” use a LATERAL join:

SELECT e.*
FROM devices d
CROSS JOIN LATERAL (
  SELECT *
  FROM events
  WHERE events.device_id = d.id
  ORDER BY created_at DESC
  LIMIT 3
) e;

Or use ROW_NUMBER() if you need ranking with ties. Do not try to hack DISTINCT ON into an N-per-group query with arrays or string aggregation. It will be slower and less readable than the right tool.

When DISTINCT ON loses

There are cases where a window function is the better choice, even in Postgres.

Ties need explicit handling. If two events for the same device have exactly the same created_at, DISTINCT ON picks one arbitrarily (or based on the next column in the ORDER BY, if you add one). A window function with RANK() or DENSE_RANK() makes tie behavior explicit.

You need the rank value. If the dashboard shows “this is the #1 event and here is its rank among all events,” you need the rank. DISTINCT ON does not compute it.

Portability is a hard requirement. If you maintain the same schema on Postgres and another database, DISTINCT ON will not port. But in practice, most teams do not switch databases. They optimize for the database they run.

Complex grouping expressions. DISTINCT ON works best when the grouping key is a single column or a small set of columns that index well. If your “group” is a computed expression like DATE_TRUNC('hour', created_at), an index may not help, and the planner may choose a sort anyway. In that case, a window function may be no slower and more standard.

The migration: replacing window functions in production

If you already have the window-function query in production, migrating to DISTINCT ON is usually safe. The result set is identical when the query is semantically correct (one row per group, deterministic ordering).

Steps:

Create the covering index in a migration. Use CONCURRENTLY to avoid locking:

CREATE INDEX CONCURRENTLY idx_events_device_created
ON events (device_id, created_at DESC);

Run EXPLAIN (ANALYZE, BUFFERS) on both the old and new queries in a read replica or a staging environment with production-like data size. Verify the buffer hit count drops.
Replace the query. If you use an ORM (Prisma, Drizzle, TypeORM), you may need to drop down to raw SQL. DISTINCT ON is not universally supported in query builders. Prisma does not support it natively as of mid-2026. Drizzle has a distinctOn operator. Raw SQL is always an option:
```
const latestEvents = await prisma.$queryRaw`
  SELECT DISTINCT ON (device_id) *
  FROM events
  ORDER BY device_id, created_at DESC
`;
```
Monitor query duration in your observability tool. The p95 should drop immediately. If it does not, check whether the planner chose a different path. Run ANALYZE events if statistics are stale.

The takeaway

DISTINCT ON is not a secret. It is right there in the Postgres documentation. But most developers learn SQL from tutorials that target the lowest common denominator across databases, so they reach for ROW_NUMBER() for every “top per group” problem. That is overkill. It sorts rows you do not need and allocates memory for partitions you will throw away.

If you run Postgres and your requirement is “the latest row per group,” DISTINCT ON with the right index is the fastest, simplest, and most resource-efficient solution. Create the composite index. Write the shorter query. Watch your dashboard drop from 800 ms to 8 ms. Then spend the time you saved on a harder problem.

A note from Yojji

Query optimization work that looks at execution plans instead of just query shape, that understands the difference between sorting a partition and scanning an index in order, and that tailors indexes to the exact query pattern instead of adding generic covering indexes everywhere, is the kind of database craft that keeps production systems fast as they grow.

Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their senior engineering teams build data-intensive backend systems and performance-tune production Postgres deployments as part of their full-cycle delivery practice, from product discovery through cloud deployment.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

The query everyone writes first

DISTINCT ON: one scan, one row per group

The index that makes it work

Partial indexes for soft deletes and stale data

The three gotchas

When DISTINCT ON loses

The migration: replacing window functions in production

The takeaway

A note from Yojji