惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Seven PRs Before Lunch: Parallel Claude Code Tabs Plus Audit-Before-Bump Deployment using all three Kubernetes probes Qwen 3.6 Has Four Tiers. Here's How to Route Without Burning Cash. RAG 시스템 실전 구축 (v21) How I handle my errors in PHP The Blind Spot in Treasure Hunt Engine Configuration: Long-Term Server Health Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint Webflow SEO Implementation 로컬 LLM 셋업 가이드 (v21) How Logs Travel From Your EKS Pod to Datadog 𝗦𝘁𝗼𝗽 𝗖𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗙𝗼𝗿 𝗘𝘅𝗮𝗺𝘀, 𝗦𝘁𝗮𝗿𝘁 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗲𝗮𝗹 𝗦𝗸𝗶𝗹𝗹𝘀 How to Use EXPLAIN ANALYZE in PostgreSQL: A Visual Guide gRPC Performance: tonic (Rust) vs grpc-go Benchmarked at Scale Hack The Box (HTB): Cap Machine (Full Walkthrough) Visual Search Optimization studygemma: AI study buddy for CS students Architectural Tradeoffs in Webhook Idempotency and SaaS API Versioning One Open Source Project a Day (No. 75): Understand Anything - The AI Engine That Turns Any Codebase Into an Explorable Knowledge Graph From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging I built a free music tool AI Talking Avatar Pipelines Broke Our Ad CTR by 3.7% 800G to 400G Breakout: How to Scale 400G Networks with 800G Ports 터미널 AI 에이전트 구축 (v20) Topical Authority Architecture Inside Hermes Agent's Session Memory: What X-Hermes-Session-Id Actually Does How Logs Travel From Your EKS Pod to Datadog The Hidden Journey Inside / Kubernetes Is it safe to connect my bank account to AI? No Room — The World of Aying (8/12) Fossils — The World of Aying (10/12) Familiar Stranger — The World of Aying (9/12) Being Seen — The World of Aying (7/12) [I Ran an AI Agent for 30 Days Straight — Here's the Boring Engineering That Made It Work] Gemma 4: The 128K Multimodal Powerhouse in Your Terminal How to Consolidate Your QA Toolstack: A Practical Buyer's Guide The Thank-You Email Almost Nobody Sends (And Why That's Your Edge) Schema Types 2026 Idempotency Keys: The API Safety Net You're Probably Not Using How to let Claude see my Plaid bank data Kiro Did It: Build a Simple Portfolio Website with Kiro IDE | From Prompt to HTML Prototype Islands of Commerce: What Marketplace Founders Can Learn from 60 Years of Island Biogeography React Pointer Hooks: Hover, Long-Press, Double-Click, Scratch, and Click-Outside Without the Bugs Engineering decisions for my video call tool VBScript Still Lives: How a Custom Go VM Brought Classic ASP to Linux and Mac What Happens When You Teach Old Scripting Languages New Runtime Tricks? I Tested 6 AI Coding Assistants for a Month. Here's What Actually Works. Extendscript Still Has Life Afriex Webhook Integration Guide: Signature Verification, Event Handling, and Production Best Practices The Blind Alleys of Veltrix Configuration How an ESP32 Turned a LEGO WALL-E Into a Real Working Robot The Flawed Promise of Real-Time Event Handling SSH Login Taking Forever? Check Your DNS Settings Found 897 Fake Followers on DEV.to. Here's How I Proved It. Retry logic, Kafka consumer lag, and the hidden failure pattern that Kubernetes won’t catch WebMCP Might Be the Most Important Announcement at Google I/O 2026 Build a Secure API with Rails 8 - Part-3: Auth Controllers I A/B tested 4 LLMs on the same 500 queries. The results surprised me. Google I/O 2026’s Smartest Developer Release Wasn’t a Model, It Was the Runtime - Managed Agents in Gemini API OSS Monthly Recap: What My Daily Commit Challenge Taught Me About Open Source “Culture” GemmaNotes Cognitive Debt: AI Is Building Your Systems. Do You Actually Understand Them? GeekNews Frontend Weekly Deep Dive - 2026-05-25 I Built a Universal Silicon Loader That Runs on Any SOC (No Bootrom Exploit) Docker容器化部署Node.js应用最佳实践 I Put a Neural Network in a Thermometer — Then It Got Out of Hand Building MGZon: Developer Portfolio + AI Bot + Social Network (9 min demo) Bearing Life (L10): What the Catalog Number Really Tells You Longhorn Volume Health: The Gap Between 'Healthy' and Actually Working Stop Prompting. Start Specifying: How Spec-Driven Development Fixes AI Coding TIL a PowerPoint file is just a zip — so I converted .pptx to Word entirely in the browser 로컬 LLM 셋업 가이드 (v18) Cx Dev Log — 2026-04-24 github's agent audit api is the boring feature that matters # From Teaching Code to Building Real-World Applications Vivado 2026.1 and Linux: why this decision matters beyond the headline Vivado 2026.1 y Linux: por qué la decisión importa más allá del titular ORA-00206 오류 원인과 해결 방법 완벽 가이드 Entidades finas e composição: o design que escolhi para a nova plataforma 10 Open Source Tools Every Developer Should Know 🔥 SSH Config File Mastery: Turning `~/.ssh/config` Into a Productivity Tool I tried to create a programming language... in python I Replaced 70MB Node.js Log Viewer with a 172KB Zig Binary I Turned npm outdated into a CI Gate — Here's How Don't fall for the Claude Mythos hype Vestige: A Gemma 4 Brain Tracker That Won't Blow Smoke Up Your Ass Gemminate: Transforming Static Textbooks into Interactive Learning Journeys with Gemma 4 Where Did All the Code Playgrounds Go? I built PROOFER - Privacy first Chrome extension that proofreads your texts using Gemma 4 I Automated My Entire Digital Product Business on a $13/Month GCP VM. Here's the Architecture. Beginner's Mind in Engineering and AI How I use AI agents to turn ideas into public demos I Built a Quotation Generator for Kenyan Street Welders Using Gemma 4's Vision The Math Behind Neural Networks — Explained Like Nobody Did for Me 🧨 Understanding TPC with IEEE802.11h What I’m Starting to Look for in Engineers An npm Downloads Comparison Chart in 300 Lines of Vanilla JS — Nice-Tick Math and API-Direct Fetch Vitreus: Local-First Spreadsheet Intelligence with Gemma 4 Transfer Fees, Metadata, and Soulbound Tokens: A Tour of Solana Token Extensions I got tired of re-explaining my codebase to ChatGPT — so I built a VS Code extension Revisiting My Phone AI After Gemma 4: The Upgrade I Didn't Know I Needed I built a privacy-first PDF merger in 7 hours — here's the stack and the lessons
PostgreSQL WAL Bloat: Why Automatic Management Is Often Insufficient?
Mustafa ERBA · 2026-05-25 · via DEV Community

PostgreSQL WAL Bloat: Why Automatic Management Is Often Insufficient?

While managing a production ERP system, I encountered an unexpected situation where the database disk space was filling up rapidly. An alarm at 03:14 AM notified me that disk usage had reached 95%. My initial investigation revealed an abnormal growth in PostgreSQL's WAL (Write-Ahead Log) directory. This was a WAL bloat problem that could severely impact system performance and even lead to data loss.

As I delved into the root cause of this issue, I better understood why PostgreSQL's automatic WAL management mechanisms are often insufficient in many scenarios. In this post, I will discuss in detail what WAL bloat is, why it occurs, and why automatic management tools are not always the solution. My aim is to provide guidance to system administrators who are experiencing or might experience this problem and to offer more proactive approaches.

What is WAL (Write-Ahead Log) and Why Is It Important?

In PostgreSQL, WAL is a fundamental mechanism used to ensure the durability of database changes. Any data modification (INSERT, UPDATE, DELETE) is first written to WAL files and then processed in the background to the actual data files. This prevents data loss in situations like system crashes or power outages. WAL files are also critical for recovery and replication.

WAL files are created sequentially under the pg_wal directory. As long as the database server is running, these files are continuously generated. WAL file management is vital for PostgreSQL's stability and performance. If WAL files are not regularly cleaned up or managed correctly, they can lead to disk space running out quickly and cause the WAL bloat problem.

ℹ️ WAL File Management

PostgreSQL has various mechanisms for managing WAL files. Settings like archive_mode and archive_command ensure that WAL files are automatically archived. wal_keep_segments (or wal_keep_size in PostgreSQL 13+ versions) determines the minimum number of WAL files required for replication or recovery. However, these settings can be insufficient in certain scenarios.

Root Causes of WAL Bloat

There can be multiple reasons behind the occurrence of WAL bloat. The first is intense write activity in the database. High transaction volumes cause WAL files to be generated very quickly. If this generation rate exceeds the rate at which WAL files are cleaned up or archived, the pg_wal directory will fill up rapidly.

The second common cause is issues preventing the cleanup of WAL files. For instance, replication delays or failures in the WAL archiving command cause PostgreSQL to wait to delete old WAL files. If a replica server falls behind or the archiving process gets stuck for any reason, WAL files will accumulate. This situation can arise from network issues or problems with the target storage space.

Thirdly, incorrectly configured WAL parameters can also lead to this problem. For example, setting wal_keep_segments (or wal_keep_size) too high can result in an unnecessary retention of too many WAL files. Conversely, having archive_mode off and wal_keep_segments set low can cause the replication server to suddenly lose synchronization and lead to WAL bloat.

Limitations of Automatic Management Tools

PostgreSQL's automatic WAL management features like archive_mode, archive_command, and wal_keep_segments work well in most standard scenarios. However, these mechanisms have certain limitations. They can be insufficient, especially in systems with high and variable workloads.

For example, wal_keep_segments (or wal_keep_size) only ensures that a certain number of WAL files are retained. If a replication server unexpectedly goes offline or cannot receive WAL files due to network issues, the pg_wal directory can fill up quickly. In such cases, PostgreSQL may stop writing new WAL files, leading to system downtime. While wal_keep_size needs to be set correctly, it's difficult for this value to adapt to dynamically changing workloads.

Another issue is the WAL archiving process itself. archive_command copies WAL files to a location. If this command fails (e.g., because the target storage is full or lacks permissions), WAL files are not cleaned up. PostgreSQL waits to delete these files, assuming the archiving process was successful. Such a stall can turn into a silent disaster causing WAL bloat.

⚠️ Risk of Archiving Failure

In one incident I experienced with a client, the WAL archiving command was failing because the target storage unit ran out of space. This situation went unnoticed initially, as the system was still running. However, over time, the pg_wal directory filled up, and database write operations began to slow down. Eventually, the system reached a breaking point. Such situations reveal the hidden weaknesses of automatic mechanisms.

Real-World Scenario: WAL Bloat in a Production ERP System

While working on an ERP system for a manufacturing firm, the database disk space was rapidly depleting due to intensive data entry and reporting operations. When I investigated the pg_wal directory to find the source of the problem, I encountered hundreds of gigabytes of WAL files. Most of these files had not been transmitted to the replication server or archived.

To diagnose the situation, I followed several steps:

  1. Monitor Disk Usage: I checked disk usage with the df -h command. I noticed that the pg_wal directory was occupying significantly more space than expected.
  2. Examine WAL Files: I listed the largest WAL files using the command ls -lhS /var/lib/postgresql/14/main/pg_wal/. I observed that most of these files were recently created and their sizes were increasing rapidly.
  3. Check Replication Status: I queried the pg_stat_replication view to check the status of the replication server. I found that the replication server was significantly behind.
  4. Check Archiving Status: I examined PostgreSQL log files to check if the WAL archiving command was running and if there were any errors. The logs showed repeated errors indicating that the archiving command was failing due to insufficient space in the target storage.

Based on these analyses, I understood that the problem stemmed from both replication lag and WAL archiving failure. I saw that the automatic management mechanisms were insufficient in this complex scenario.

Solution: Proactive Approaches and Manual Interventions

To resolve the WAL bloat issue and prevent its recurrence, adopting proactive approaches is essential. Instead of relying solely on automatic settings, it's crucial to deeply understand the system's behavior and perform manual interventions when necessary.

The first step is to continuously monitor the WAL generation rate and the cleanup/archiving rate. The pg_stat_wal view shows the number and size of current WAL segments and how long they have been active. The pg_stat_archive view helps track the status and success of the archiving process. Regularly collecting and analyzing these metrics allows us to detect potential issues early.

If we encounter WAL bloat, the first intervention is usually to manually clean up old WAL files. However, this process must be done carefully. Before deleting files in the pg_wal directory, it's important to understand if these files are necessary for replication or recovery. Tools like pg_waldump can be used to inspect the contents of WAL files. However, the safest approach is to let PostgreSQL's own mechanisms function correctly.

If the problem is caused by the failure of the archiving command, the first priority is to fix the issue with the target storage space. This might involve freeing up disk space, correcting access permissions, or changing the archiving destination. Once the issue is resolved, PostgreSQL can be allowed to clean up WAL files automatically.

💡 Manual WAL Cleanup (With Caution!)

If you need to manually clean the pg_wal directory in an emergency, ensure you are targeting the correct files using the find command before using rm. For example, to delete WAL files older than a specific date:

find /var/lib/postgresql/14/main/pg_wal/ -type f -name '*-*' -mtime +7 -delete

This command deletes files older than 7 days whose names contain '-'. However, this method is risky and should be used with caution in production environments.

Future Strategies and Performance Optimization

Merely performing immediate interventions is not enough to permanently solve the WAL bloat problem. It's necessary to develop long-term strategies and optimize database performance. These strategies include reviewing database schemas, optimizing queries, and reducing unnecessary write operations.

For instance, intensive repetitive updates or delete operations increase WAL file generation. Making such operations more efficient can reduce WAL production. Additionally, proper configuration of autovacuum settings is also important. autovacuum improves performance by cleaning up dead tuples and updating database statistics, which can indirectly affect WAL usage.

Optimizing PostgreSQL's WAL-related parameters (e.g., wal_level, full_page_writes, wal_buffers) according to your workload and hardware can also enhance performance. However, these settings should always be tested carefully, and their effects understood. Incorrect settings can degrade performance or increase the risk of data loss.

Finally, continuously monitoring WAL generation, archiving status, and replication lag using modern monitoring tools allows you to proactively detect potential issues. This way, you can take preventive measures before critical problems like WAL bloat arise.

Conclusion

WAL bloat in PostgreSQL is a problem that seriously threatens system stability and performance. While automatic WAL management mechanisms are sufficient in most cases, they can fall short in situations involving high workloads, network issues, or configuration errors. Therefore, proactively monitoring WAL generation, archiving, and replication status, detecting problems early, and performing manual interventions when necessary are of vital importance. Optimizing database schemas, improving queries, and correctly configuring background processes like autovacuum are also part of long-term solutions. By following these steps, you can ensure the stability and performance of your PostgreSQL databases.