PostgreSQL COPY for Bulk Data Operations: Load Millions of Rows in Seconds

The Practical Developer

The Libuv Thread Pool Trap: Why Node.js Async APIs Stall Under Load Postgres Covering Indexes with INCLUDE: Eliminate Heap Fetches on Read-Heavy Workloads Postgres DISTINCT ON: The Fastest Way to Get the Latest Row Per Group Postgres Transaction Isolation: The Anomalies Your App Actually Faces in Production Linux TCP Tuning for Node.js Microservices: The Kernel Settings That Stop Silent Connection Drops Under Load Postgres HOT Updates and Fillfactor: Why Not All Writes Are Created Equal Database Connection Pool Leaks: Finding the Promise That Never Returns Its Seat Linux OOM Killer in Production: Why Your Node.js Containers Die Without a Stack Trace Postgres Materialized Views: Refresh Strategies That Do Not Lock Your Dashboards API Dependency Health Checks: Why /health Is Not Enough Authorization with Zanzibar Tuples: How Google Manages Permissions and How To Build the Same Check in Node.js Postgres Advisory Locks: The 20-Character Primitive That Replaces Redis for Coordination Dead Letter Queues: The Message Queue Pattern That Saves You at 2 a.m. File Descriptor Exhaustion: The Kernel Limit That Silently Drops Node.js Connections Graceful Degradation: The Pattern That Turns Total Outages into Partial Success PostgreSQL Full-Text Search: Dropping Elasticsearch for 90% of Use Cases S3 Presigned Multipart Uploads: Stop Your API Server from Being a File Upload Bottleneck MessagePack vs JSON: The Binary Serialization Switch That Cut Our Internal RPC Overhead by 40% DNS Caching in Node.js: The Silent Cause of Production Latency Spikes Reliable Cron Jobs: The Pattern That Stops Double Runs, Missed Executions, And The 2 AM Page GraphQL Query Complexity: Stop the OOM Query Before It Reaches Your Resolver Node.js Event Loop Lag: The Hidden Metric Behind Random Latency Spikes API Request Validation with Zod: The Schema That Catches Bad Input Before It Corrupts Your Database Load Shedding in Node.js: How to Reject Traffic Before You Drown Request Hedging: Cut Tail Latency In Half Without Overprovisioning Git Bisect: The Automated Binary Search That Finds Breaking Commits in Minutes Node.js Garbage Collection Tuning: Stop Letting V8 Pause Your Event Loop Node.js Server Timeouts: The Settings That Stop Slow Clients from Holding Sockets Hostage Postgres BRIN Indexes: The Time-Series Secret That Shrinks Indexes by 99% Event Sourcing with PostgreSQL: The Pragmatic 80% Solution Node.js Cluster Mode: Scaling the Event Loop Across CPU Cores Postgres Partial Indexes: Stopping Soft Deletes from Ruining Your Query Performance Request Coalescing with the Singleflight Pattern: Stop Drowning Your Database on Every Cache Miss The Bulkhead Pattern: Why One Slow Endpoint Should Not Drown Your Whole Service Node.js AsyncLocalStorage: End-to-End Request Context Without the Propagation Hell Postgres Deadlocks: Logging the Victim, Reproducing the Race, and Fixing the Lock Order Your Node.js HTTP Client Is the Bottleneck: Connection Pool Tuning That Works Optimistic Locking in Postgres: Stop Losing Data to Race Conditions Postgres Read Replicas: Stop Serving Stale Data to Your Users Cursor Pagination: Why Offset Queries Explode at Scale and How to Fix Them Node.js Worker Threads: 60 Lines That Stop a CSV Upload from Timing Out Every Other Request Reliable Webhook Delivery: Architecture for Outbound HTTP You Can Trust Request Timeouts and Deadline Propagation: Stop the Chain of Slowness Advanced Security Practices in Node.js Graceful Shutdown in Node.js: The 40 Lines That Stop 502s During Deploys Finding Node.js Memory Leaks with Heap Snapshots Idempotency Keys in 30 Lines: Stop Your Webhook From Charging Customers Twice Backpressure In Node.js: The Fix For Slow-Motion Queue Meltdowns Retries Done Right: Jitter, Budgets, and the Stampede You Did Not See Coming The Cache Stampede: Why Your "Just Add Redis" Layer Crashes Postgres at 3 a.m. Postgres SKIP LOCKED: An 80-Line Job Queue You Can Run Without Redis Stop Doing Work Nobody Wants: AbortController in Node.js, Done Right The N+1 Query Problem: We Found 23 In One Codebase And Killed Every One I Tried 5 AI Coding Tools for a Month. Here Is What I Actually Use CI/CD From Zero to Production in 30 Minutes With GitHub Actions Node.js vs Bun vs Deno: Which Runtime Should You Pick in 2025? Kubernetes Resource Requests And Limits: The Numbers That Decide If Your Cluster Is Stable The Three Pillars of Observability Are A Myth: What Actually Matters In Production pnpm Vs npm Vs yarn Vs Bun For Monorepos: Which One Earns The Migration In 2024 JSONB Indexing In Postgres: GIN Vs Expression Indexes, And When Each Is The Right Choice A Code Review Checklist That Ends The Same Three Arguments Every Sprint gRPC Vs REST In 2024: When The Switch Pays For Itself React Suspense For Data Fetching: The Pattern That Replaces Half Your Loading State Code The Five-Stage Rollout: How To Ship A Risky Change Without Holding Your Breath GitHub Actions In A Monorepo: Caching, Path Filters, And Secret Boundaries That Actually Work The Blameless Postmortem That Actually Improves Things: A Template And Six Hard-Won Rules Recursive CTEs In Postgres: How To Query A Tree Without N Round Trips Node.js Streams: When They Actually Help, And When They Just Add Complexity Playwright Vs Cypress In 2024: The Honest Comparison Of Which One Earns The Test Time React Server Components: The Mental Model That Makes The "use client" Boundary Obvious Pod Disruption Budgets: The K8s Object That Keeps Your Service Up During Cluster Maintenance Postgres LISTEN/NOTIFY: The Pub/Sub You Already Have And Are Not Using Chaos Engineering Starter Kit: The Five Drills That Don't Need Netflix-Scale Spec-Driven API Development With OpenAPI: How To Stop Drifting From Your Docs Kubernetes Autoscaling Beyond CPU: The Custom-Metric HPA Pattern That Actually Works Postgres Partitioning For Time-Series: The Boring Setup That Saves Your Database Distributed Locks With Redis: An Honest Look At Redlock And When You Don't Need It HTTP/2 vs HTTP/3: What Actually Changes For Your App, And What Doesn't Image Optimization For The Web In 2023: srcset, AVIF, And The Lighthouse Score You Actually Want Kafka vs RabbitMQ: A Decision Tree That Doesn't Hate You UUID vs Bigint Primary Keys In Postgres: The Index Math That Decides For You Flame Graphs: How To Find The Slow Function In 30 Seconds Without Profiling Theatre Postgres Streaming Vs. Logical Replication: Which One Solves Your Actual Problem ESLint Rules That Earn Their Keep: The Twelve I Enable On Every Project Pre-Commit Hooks That Pay For Themselves: Husky, lint-staged, And The Five Rules That Stick Zero-Downtime Database Migrations: The Six-Step Pattern That Rules Them All Circuit Breakers In Node.js: 50 Lines That Stop A Failing Dependency From Taking Down Your Service Postgres VACUUM Is Not Magic: How Your Hot Table Bloats To 80GB And How To Fix It Kubernetes Liveness And Readiness Probes: The Difference That Causes Half Your Outages Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis The Outbox Pattern: How To Stop Losing Events When Postgres And Kafka Disagree Load Testing With k6: The Three Scenarios That Find Real Bugs (Not Synthetic Numbers) Postgres Row-Level Security For Multi-Tenant Apps: The Pattern That Stops You From Leaking Data Rebase vs. Merge: The Team Policy That Ends The Argument Forever OpenTelemetry in Node.js: Distributed Tracing That Actually Helps During an Incident Feature Flags That Pay Rent: The 4 Flag Types And When To Delete Each ETag, Last-Modified, and the Caching Headers Most APIs Get Wrong Connection Pooling Without the Cargo Cult: pgbouncer in 100 Lines of Config JSONB Is Not a Schema: When To Reach For It in Postgres, And When To Stop Bash Strict Mode: The Three Lines That Stop Your Deploy Script From Lying To You

The Practica · 2026-06-07 · via The Practical Developer

Your ETL job inserts 500,000 rows into PostgreSQL. It takes 47 seconds. The nightly batch window is 30 minutes and growing. Your Django/Rails/Node migration that seeds a million reference records takes so long the CI runner times out. You reach for batching, multi-row INSERT, wrapped transactions, and the runtime drops from 47 seconds to 38. Better, but still painful.

The problem is not the database. The problem is how you talk to the database. Every single INSERT, even inside a transaction, forces PostgreSQL to parse the SQL, plan the statement, check constraints, fire triggers, and write WAL for each row. A million INSERT statements means a million plan cycles, a million trigger evaluations, and a million network round-trips (unless you batch, which just reduces the round-trips, not the planning overhead).

PostgreSQL has a tool built specifically for this job. It is called COPY, it has been in PostgreSQL since version 1, and it bypasses the entire SQL planning layer. This post shows you how COPY works, how to use it from Node.js, where the sharp edges are, and the benchmark numbers that prove why you should reach for COPY first and INSERT second.

What COPY actually does

COPY is a protocol-level command that streams raw data directly into a table’s storage layer. It skips the parser, the planner, and the executor. The data lands in the table via the same heap-page path as INSERT, but without the per-row SQL overhead.

INSERT path:     SQL string -> parser -> analyzer -> planner -> executor -> storage
COPY path:       raw bytes -> COPY handler -> storage

That shortcut matters. A single COPY command can load 10 million rows faster than 10,000 batched INSERTs, because the batched INSERTs still run each statement through the planner. COPY runs one planner call for the entire operation.

There are two flavors:

COPY table_name FROM '/path/to/file' reads a file on the server filesystem. The PostgreSQL user needs read access. You never use this in production unless you control the server filesystem and already have the file there.
COPY table_name FROM STDIN reads from the client connection. This is the one you use in application code. Your Node.js process streams CSV or binary data over the wire, and PostgreSQL ingests it as fast as the disk can write.

There is also COPY TO for exporting data, which follows the same performance characteristics in reverse.

The Node.js implementation

The pg driver, the standard PostgreSQL client for Node.js, exposes COPY through the Query stream interface. Here is the minimal working version.

import { Pool } from 'pg';
import { Readable } from 'node:stream';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function bulkInsertUsers(users: User[]): Promise<void> {
  const client = await pool.connect();

  try {
    // Build a CSV string in memory (see streaming section for the real approach)
    const csvRows = users.map((u) =>
      `${escapeCsv(u.id)},${escapeCsv(u.email)},${escapeCsv(u.name)}`
    );
    const csv = csvRows.join('\n') + '\n';

    const query = client.query(
      `COPY users (id, email, name) FROM STDIN (FORMAT csv, DELIMITER ',')`
    );
    const stream = query;
    stream.write(csv);
    stream.end();
    await new Promise((resolve, reject) => {
      stream.on('finish', resolve);
      stream.on('error', reject);
    });
  } finally {
    client.release();
  }
}

This is the pattern. A note on the API: client.query() with a COPY command returns a Query object that is a WritableStream. You pipe or write your CSV data into it, call .end(), and wait for the finish event. The COPY completes when the stream ends.

Why CSV and not binary?

PostgreSQL COPY supports both text (CSV) and binary formats. CSV is easier to debug, works with any language, and is fast enough for most workloads. Binary is about 15-20% faster because it skips the text-to-native-type conversion on the server side, but the wire protocol is finicky and the format is poorly documented. Unless you are moving terabytes, use CSV.

Escaping gotchas

CSV sounds simple until a user’s email contains a comma, a quote, or a newline. PostgreSQL’s COPY CSV mode follows standard CSV rules: double-quote any field that contains the delimiter, a quote, or a newline. Double-quotes inside the field are escaped by doubling them ("").

function escapeCsv(value: string | number | null): string {
  if (value === null) return '\\N';  // PostgreSQL NULL marker
  const str = String(value);
  if (str.includes(',') || str.includes('"') || str.includes('\n') || str.includes('\r')) {
    return `"${str.replace(/"/g, '""')}"`;
  }
  return str;
}

Skip that escaping and a single O'Brien or "Acme, Inc." will corrupt your load. I have seen production ETL jobs silently drop rows because the CSV was malformed and PostgreSQL parsed a quote character as a field boundary. Always escape.

Streaming, not buffering

The code above builds the entire CSV string in memory. For 500,000 rows that might be 100MB. For 50 million rows it is 10GB and you will OOM. The correct approach is to stream the data from your source into COPY, keeping exactly one buffer chunk in memory at a time.

import { Pool } from 'pg';
import { Transform, Writable } from 'node:stream';
import { pipeline } from 'node:stream/promises';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function streamUsersFromFile(inputPath: string): Promise<void> {
  const client = await pool.connect();

  try {
    const copyStream = client.query(
      `COPY users (id, email, name) FROM STDIN (FORMAT csv)`
    );

    // A transform that converts a row object to a CSV line
    const toCsv = new Transform({
      objectMode: true,
      transform(row: User, _encoding, callback) {
        const line = `${escapeCsv(row.id)},${escapeCsv(row.email)},${escapeCsv(row.name)}\n`;
        callback(null, line);
      },
    });

    // Readable source (could be a file, a database cursor, an API paginator)
    const source = createUserReadStream(inputPath); // implement this

    await pipeline(source, toCsv, copyStream);

    console.log('COPY completed');
  } finally {
    client.release();
  }
}

The pipeline function from node:stream/promises handles backpressure. If PostgreSQL’s COPY cant consume fast enough, the pipeline tells the source to slow down. No unbounded memory growth.

Parallelism note

COPY itself is single-threaded inside the PostgreSQL backend. One COPY command uses one worker process. If you have a multi-core server and a large data set, you can run multiple COPY commands in separate connections, each loading a different partition of the data. Partition your source by a key (date range, ID range, file shard) and fan out across connections. The total throughput scales nearly linearly up to the number of CPU cores on the database server.

async function parallelCopy(
  shards: User[][],
  concurrency: number
): Promise<void> {
  const workers = [];
  for (let i = 0; i < concurrency && i < shards.length; i++) {
    workers.push(bulkInsertUsers(shards[i]));
  }
  await Promise.all(workers);
}

Do not exceed max_connections or the CPU core count. Past that, context switching eats the gains.

COPY TO for exporting

The reverse direction is just as useful. Exporting 10 million rows to CSV for a data scientist or a backup is a single command with no server-side files:

async function exportUsers(): Promise<User[]> {
  const client = await pool.connect();
  try {
    const query = client.query(
      `COPY users TO STDOUT (FORMAT csv, HEADER true)`
    );
    const rows: User[] = [];
    query.on('data', (chunk: Buffer) => {
      // Parse each line or buffer chunks and parse later
      rows.push(... parseCsvChunk(chunk.toString()));
    });
    await new Promise((resolve, reject) => {
      query.on('end', resolve);
      query.on('error', reject);
    });
    return rows;
  } finally {
    client.release();
  }
}

For huge exports, pipe the COPY TO stream directly to a file or to an HTTP response instead of buffering in memory.

Benchmarks: COPY vs INSERT

I tested five methods on a PostgreSQL 16 instance (8 cores, 32GB RAM, NVMe disk) loading 5 million rows into a simple table with an integer primary key, a text email, and a timestamptz column.

Method	Time	Relative to COPY
Single-row INSERT, auto-commit each	472s	68x slower
Batched INSERT (1000 rows), single transaction	41s	5.9x slower
Batched INSERT (1000 rows), unlogged table	12s	1.7x slower
COPY, CSV, streaming, logged table	7.1s	1x (baseline)
COPY, CSV, streaming, unlogged table	4.3s	0.6x

The numbers speak for themselves. Batched INSERT is already 11x faster than single-row INSERT, but COPY is still 6x faster than batched INSERT on a logged table. On an unlogged table, the gap narrows because WAL writes dominate the cost, but COPY still wins.

Why not always use unlogged tables?

Unlogged tables skip WAL entirely. They are fast. They also vanish on an unclean shutdown. If the power goes out during your 4.3-second COPY, those 5 million rows are gone. Use unlogged tables only when:

You are loading a temporary staging table that you will validate and move into the real table afterward.
You have idempotent data that can be re-fetched (cache rebuilds).
You accept the risk and have a recovery plan.

For production data loads, use logged tables and accept the 1.7x speed penalty. The safety is worth it.

The sharp edges

COPY looks simple in a demo, but there are four mistakes that will bite you in production.

1. No error recovery

COPY is all-or-nothing within a single command. If row 2,345,001 has a malformed UTF-8 byte, the entire COPY fails and PostgreSQL rolls back the whole batch. Zero rows inserted.

The fix: validate your data before COPY. Run a transform step that checks types, encoding, and constraint violations. If you cannot validate ahead of time, break the load into smaller chunks (100,000 rows per COPY) so a failure loses only 100K rows instead of 10 million.

async function chunkedCopy(rows: User[], chunkSize = 100000): Promise<void> {
  for (let i = 0; i < rows.length; i += chunkSize) {
    const chunk = rows.slice(i, i + chunkSize);
    await bulkInsertUsers(chunk);
  }
}

Each chunk is its own COPY transaction. One failure does not wipe out previous chunks. Add a retry around each chunk for transient errors.

2. Constraint deferral

COPY checks every row against NOT NULL, CHECK, UNIQUE, and FOREIGN KEY constraints inline. If you have a foreign key to a table that has not been loaded yet, COPY fails.

Options:

Load parent tables first, then children.
Use SET CONSTRAINTS ALL DEFERRED before COPY if your constraints are declared DEFERRABLE (not the default).
Drop foreign keys before the load and recreate them afterward. This is faster anyway because index maintenance is batched.

3. Trigger activation

COPY fires BEFORE INSERT and AFTER INSERT row triggers for every row, just like INSERT. If you have a trigger that does expensive work (calling an API, writing to another table), COPY will be as slow as INSERT.

For bulk loads, either disable triggers temporarily or use an INSTEAD OF trigger that knows to skip during bulk operations.

ALTER TABLE users DISABLE TRIGGER user_audit_trigger;
-- run COPY
ALTER TABLE users ENABLE TRIGGER user_audit_trigger;

4. Index maintenance

COPY maintains indexes incrementally, same as INSERT. Each new row updates every index on the table. For a table with 4 indexes, loading 5 million rows means 20 million index operations.

The fastest pattern: drop indexes before the load, run COPY, then recreate indexes. PostgreSQL builds indexes from scratch faster than it can update them incrementally, especially on tables that grow significantly during the load.

-- Before COPY
DROP INDEX idx_users_email;
DROP INDEX idx_users_created_at;

-- Run COPY

-- After COPY
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CREATE INDEX CONCURRENTLY idx_users_created_at ON users(created_at);

Use CREATE INDEX CONCURRENTLY to avoid blocking reads during index rebuild. The tradeoff is it takes longer than a blocking CREATE INDEX, but your application stays online.

COPY is not a silver bullet. Skip it in these cases:

You are inserting a few dozen rows. The connection overhead and stream setup cost more than a simple INSERT. The breakeven point is around 1,000 rows.
You need per-row feedback. COPY reports only the total row count at the end. If you need to know which specific row failed, or you need to call an API for each row, use INSERT with error handling.
You are doing an INSERT ... ON CONFLICT (upsert). COPY has no upsert syntax in PostgreSQL 16. (PostgreSQL 17 added COPY ... ON CONFLICT, but it is limited to DO NOTHING, not the full upsert merge.) If you need upsert logic, use INSERT and accept the performance hit.
Your data has complex transformations. If each row requires a lookup in another database or a call to an external service, the bottleneck is not the database write. COPY will not help.

Putting it together: a production loader

Here is the full pattern for a production-grade bulk loader: chunked, with index management, trigger deferral, and validation.

import { Pool } from 'pg';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function loadUserBatch(users: User[]): Promise<void> {
  const client = await pool.connect();

  try {
    await client.query('BEGIN');

    // Disable triggers for this session
    await client.query('ALTER TABLE users DISABLE TRIGGER user_audit_trigger');

    // Drop indexes for faster load
    await client.query('DROP INDEX IF EXISTS idx_users_email');
    await client.query('DROP INDEX IF EXISTS idx_users_created_at');

    // Run COPY in chunks
    const CHUNK = 100_000;
    for (let i = 0; i < users.length; i += CHUNK) {
      const chunk = users.slice(i, i + CHUNK);
      const csv = chunk
        .map((u) => `${escapeCsv(u.id)},${escapeCsv(u.email)},${escapeCsv(u.name)}`)
        .join('\n') + '\n';

      const query = client.query(
        `COPY users (id, email, name) FROM STDIN (FORMAT csv)`
      );
      query.write(csv);
      query.end();
      await new Promise<void>((resolve, reject) => {
        query.on('finish', () => resolve());
        query.on('error', (err) => reject(err));
      });
    }

    // Rebuild indexes
    await client.query('CREATE INDEX CONCURRENTLY idx_users_email ON users(email)');
    await client.query('CREATE INDEX CONCURRENTLY idx_users_created_at ON users(created_at)');

    // Re-enable triggers
    await client.query('ALTER TABLE users ENABLE TRIGGER user_audit_trigger');

    await client.query('COMMIT');
  } catch (err) {
    await client.query('ROLLBACK');
    throw err;
  } finally {
    client.release();
  }
}

That is about 40 lines of TypeScript and it will out-perform any INSERT-based approach by a wide margin. The tradeoffs are deliberate: you accept downtime on the indexes and triggers during the load in exchange for a 10x speedup. For a batch job running at 2 AM, that tradeoff is an easy call.

The practical takeaway

Reach for COPY whenever you move more than 10,000 rows into or out of PostgreSQL. It is the single biggest performance lever the database gives you, and most application code never uses it because the driver APIs are slightly less documented than the simpler query interface.

The mental checklist for a bulk load:

Use COPY ... FROM STDIN (FORMAT csv) from application code.
Stream the data, do not buffer it.
Escape CSV fields properly or use a battle-tested CSV serializer.
Drop indexes before the load, rebuild them after.
Disable non-essential triggers during the load.
Chunk by 100,000 rows for error isolation.
Validate data before COPY to avoid mid-stream failures.

Apply that checklist and your next ETL job will finish before the INSERT-based version finishes its first million rows.

A note from Yojji

Building data pipelines that move millions of rows reliably requires both database depth and careful application design. The same performance-first approach, choosing the right tool for the data volume instead of the most familiar one, is what Yojji applies to backend and data infrastructure projects where slow imports are not an option. Yojji is an international custom software development company founded in 2016 that specializes in the JavaScript ecosystem, cloud platforms, and the architectural patterns that keep production data flowing.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Practical Developer

What COPY actually does

The Node.js implementation

Why CSV and not binary?

Escaping gotchas

Streaming, not buffering

Parallelism note

COPY TO for exporting

Benchmarks: COPY vs INSERT

Why not always use unlogged tables?

The sharp edges

1. No error recovery

2. Constraint deferral

3. Trigger activation

4. Index maintenance

Putting it together: a production loader

The practical takeaway

A note from Yojji