`PostgreSQL`: From Metrics to Query Plan Analysis

Ogenki

Self-hosted LLM stack: a solid foundation for an open-weight platform built to evolve A few months with `Claude Code`: tips and workflows that helped me `Agentic Coding`: concepts and hands-on Platform Engineering use cases `VictoriaLogs`: What if logs management became simple and performant? `VictoriaMetrics` : Effective alerts, from theory to practice 🛠️ Harness the Power of `VictoriaMetrics` and `Grafana` Operators for Metrics Management `Dagger`: The missing piece of the developer experience? `TLS` with Gateway API: Efficient and Secure Management of Public and Private Certificates Going Further with `Crossplane`: Compositions and Functions Beyond Traditional VPNs: Simplifying Cloud Access with `Tailscale` `Gateway API`: Can I replace my Ingress Controller with `Cilium`? Applying GitOps Principles to Infrastructure: An overview of `tf-controller` `CloudNativePG`: An easy way to run PostgreSQL on Kubernetes 100% `GitOps` using Flux My Kubernetes cluster (GKE) with `Crossplane` Manage tools versions with `asdf` Helm workshop: Templating exercises Helm workshop: Build your first chart Helm workshop: Lifecycle operations Helm workshop: Ecosystem Helm workshop: Third party charts Helm workshop Kubernetes workshop: Manage permissions in Kubernetes Kubernetes workshop: Troubleshooting Kubernetes workshop: Resources allocation and autoscaling Kubernetes workshop: Complete application stack Kubernetes workshop: Local environment Run an application on Kubernetes Kubernetes workshop

2025-11-15 · via Ogenki

Having basic observability when running PostgreSQL in production isn't optional. Whether it's tracking the Golden Signals, monitoring slow queries, or analyzing connection patterns, this often requires plumbing multiple tools together, configuring various exporters, and manually extracting queries for analysis.

❓ What if we could achieve complete production-grade monitoring with minimal configuration?

In this article, we'll explore how to enhance CloudNativePG's already robust monitoring with simple, effective, and easy-to-use query performance analysis capabilities—leveraging the power of tools at our disposal: Vector, VictoriaMetrics, and VictoriaLogs.

📊 Included with CloudNativePG

When you deploy a PostgreSQL cluster with CNPG, the operator exposes comprehensive metrics via a dedicated endpoint on each PostgreSQL instance (port 9187). While Grafana dashboards are not included by default, CloudNativePG provides official dashboard templates that can be deployed declaratively.

The metrics exposed include:

Database Operations: Transaction rates, queries per second, tuple statistics
Replication Status: Lag, streaming state, synchronization metrics
Resource Utilization: Connections, cache hit ratio, buffer statistics
System Health: Instance status, failover events, backup states

GitOps and Kubernetes Operators

The examples in this article come from configurations available in the Cloud Native Ref repository.
It leverages several operators, including CloudNativePG for PostgreSQL management, VictoriaMetrics for metrics collection, and VictoriaLogs for log collection.

This project aims to quickly bootstrap a complete platform that follows best practices in automation, monitoring, security, and more.
Comments and contributions are welcome 🙏

Collecting Metrics with VictoriaMetrics

CloudNativePG automatically exposes metrics on each PostgreSQL pod. Collecting them is enabled by setting this Helm value:

1# CloudNativePG Helm chart
2monitoring:
3  podMonitorEnabled: true

This simple configuration creates a PodMonitor (Prometheus Operator resource) that is automatically converted by the VictoriaMetrics operator into a compatible native resource. Metrics from all PostgreSQL pods (primary and replicas) are thus collected and available in VictoriaMetrics.

Prometheus Compatibility

The VictoriaMetrics operator automatically converts Prometheus Operator resources (PodMonitor, ServiceMonitor, etc.) into their VictoriaMetrics equivalents. This transparent conversion allows using CloudNativePG without modification, while benefiting from VictoriaMetrics as the storage backend.

Main Metrics to Monitor

CloudNativePG exposes metrics that align perfectly with the Golden Signals methodology discussed in previous articles:

Latency ⏳

1# Average query duration
2rate(cnpg_backends_total_seconds_sum[5m]) / rate(cnpg_backends_total_seconds_count[5m])

Traffic 📶

1# Transactions per second
2rate(pg_stat_database_xact_commit[5m]) + rate(pg_stat_database_xact_rollback[5m])

Errors ❌

1# Connection failures and deadlocks
2rate(pg_stat_database_deadlocks[5m])
3rate(cnpg_pg_postmaster_start_time_seconds[5m])

Saturation 📈

1# Connection pool usage
2cnpg_backends_total / cnpg_pg_settings_max_connections
3
4# Cache hit ratio (should be > 95%)
5sum(rate(pg_stat_database_blks_hit[5m])) /
6  (sum(rate(pg_stat_database_blks_hit[5m])) + sum(rate(pg_stat_database_blks_read[5m])))

Visualization with Grafana

Using the Grafana Operator explored in previous articles, we can deploy CNPG dashboards declaratively:

 1apiVersion: grafana.integreatly.org/v1beta1
 2kind: GrafanaDashboard
 3metadata:
 4  name: databases-cloudnative-pg
 5spec:
 6  allowCrossNamespaceImport: true
 7  folderRef: "databases"
 8  datasources:
 9    - inputName: "DS_PROMETHEUS"
10      datasourceName: "VictoriaMetrics"
11  instanceSelector:
12    matchLabels:
13      dashboards: "grafana"
14  url: "https://grafana.com/api/dashboards/20417/revisions/4/download"

The dashboard provides detailed views of our PostgreSQL clusters, including replication lag, query performance, and resource utilization.

🔍 Understanding Query Performance

While metrics give us the "what" and "when," they don't always tell us the "why." This is where query performance analysis becomes essential. Knowing that queries are slow is useful; understanding why they're slow often leads us to optimization opportunities.

Traditional PostgreSQL query analysis requires manually running EXPLAIN and EXPLAIN ANALYZE commands. While powerful, this approach has limitations:

Reactive: You only analyze queries you suspect are problematic
Manual: Requires active investigation by DBAs
Point-in-time: Captures current execution plan, not historical trends
Incomplete: Difficult to correlate with production load patterns

Ideally, we need automatic and continuous query plan capture that allows us to:

Automatically identify slow queries
Track execution plan changes over time
Correlate query performance with system metrics
Debug performance issues without manually reproducing them

This is exactly what some managed solutions offer. But can we achieve the same on Kubernetes?

✨ Query Plan History: Implementation with Open Source Tools

The good news is that we can build a sophisticated query performance monitoring system that rivals commercial offerings with minimal configuration.

The Architecture

Our solution leverages two PostgreSQL extensions and a configuration parameter, integrating them with the VictoriaMetrics ecosystem:

PostgreSQL Extensions:

pg_stat_statements: Aggregates query execution statistics
auto_explain: Automatically captures execution plans for slow queries

PostgreSQL Configuration:

compute_query_id: Parameter that generates unique identifiers for query correlation

Observability Stack:

VictoriaMetrics: Stores query metrics from pg_stat_statements
VictoriaLogs: Stores execution plans with query correlation
Vector: Parses PostgreSQL logs and extracts execution plans
Grafana: Visualizes performance data and enables plan history exploration

The key is the link between all these elements: the correlation between metrics and logs using the query identifier. This allows us to:

See that a query is slow (from metrics)
Click to view its execution plan history (from logs)
Identify plan changes that caused performance regressions

I've called this feature "Performance Insights". Any resemblance to an existing solution would be purely coincidental 😆.

Enabling Performance Insights

Thanks to CloudNativePG's "Managed Extensions" feature, enabling PostgreSQL extensons is very easy.

🏗️ Platform Engineering: The Right Level of Abstraction

One of the key principles of platform engineering is providing the right level of abstraction to application developers. They shouldn't need to understand PostgreSQL internals or memorize 15+ PostgreSQL-specific configuration parameters.

This is where Crossplane compositions shine. In the Cloud Native Ref project, we use Crossplane with KCL (Kubernetes Configuration Language) to create a higher-level abstraction called SQLInstance.

Without Composition (Raw CNPG Cluster):

 1apiVersion: postgresql.cnpg.io/v1
 2kind: Cluster
 3metadata:
 4  name: myapp-postgres
 5spec:
 6  instances: 3
 7  postgresql:
 8    shared_preload_libraries:
 9      - pg_stat_statements
10      - auto_explain
11    parameters:
12      pg_stat_statements.max: "10000"
13      pg_stat_statements.track: all
14      pg_stat_statements.track_utility: "on"
15      pg_stat_statements.track_planning: "on"
16      pg_stat_statements.save: "on"
17      auto_explain.log_format: json
18      auto_explain.log_min_duration: "1000"
19      auto_explain.log_analyze: "on"
20      auto_explain.log_buffers: "on"
21      auto_explain.log_timing: "off"  # Optimization to reduce overhead (PostgreSQL default: on)
22      auto_explain.log_triggers: "on"
23      auto_explain.log_verbose: "on"
24      auto_explain.log_nested_statements: "on"
25      auto_explain.sample_rate: "0.2"
26      compute_query_id: on
27      track_activity_query_size: 2048
28      track_io_timing: "on"
29      log_min_duration_statement: 1000
30      # ... and more

With Composition (Platform Engineering Approach):

 1apiVersion: cloud.ogenki.io/v1alpha1
 2kind: App
 3metadata:
 4  name: myapp
 5spec:
 6  sqlInstance:
 7    enabled: true
 8    size: small
 9    storageSize: 20Gi
10    instances: 3
11    performanceInsights:
12      enabled: true
13      explain:
14        sampleRate: 0.2       # 20% sampling (default: safe for production)
15        minDuration: 1000     # Log queries > 1 second (default)
16      logStatement: none      # Optional: none (default) / ddl / mod / all

The Crossplane composition SQLInstance handles all the complexity.

This composition approach provides several benefits:

Developer Experience: Application developers don't need PostgreSQL expertise
Consistency: Performance monitoring is configured uniformly across all databases
Maintainability: Platform team controls monitoring configuration centrally
Scalability: Easy to update monitoring parameters for all instances
Discoverability: Developers can browse available options (performanceInsights: true) rather than memorizing parameter names

Platform Engineering Principle

The best abstractions hide complexity without limiting power. Developers get performance insights with simple parameters, while the platform team is still able to fine-tune the underlying PostgreSQL configuration for advanced use cases.

Understanding the Configuration

Let's break down what each component does:

pg_stat_statements: This extension tracks execution statistics for all SQL statements executed by a server. It records:

Total execution time and number of calls
Rows processed and returned
Buffer hits and reads
Query planning time

auto_explain: Automatically logs execution plans for queries exceeding a duration threshold. Key parameters include:

log_format: json: Structured output for parsing
log_min_duration: 1000: Capture queries taking more than 1 second (default)
log_analyze: on: Include actual row counts (includes ANALYZE data from the actual execution)
sample_rate: 0.2: Sample 20% of slow queries to reduce overhead (composition default; PostgreSQL default is 1.0)

compute_query_id: The correlation key that ties everything together. This generates a unique identifier for each query that appears in both pg_stat_statements metrics and auto_explain logs.

Composition Default Values

The SQLInstance composition sets production-safe defaults (different from PostgreSQL defaults):

sampleRate: 0.2 → 20% sampling (PostgreSQL default: 1.0 = 100%)
minDuration: 1000ms → Queries > 1 second (PostgreSQL default: -1 = disabled)

For debugging, you can override these:

sampleRate: 1.0 → 100% of slow queries
minDuration: 0 → All queries, even the fastest ones

Vector Log Pipeline Configuration

Here's what Vector does concretely - transforming a PostgreSQL auto_explain log into an indexable event:

Raw Log (CloudNativePG)	After Vector Parsing
{ "timestamp": "2025-01-15T14:32:18.456Z", "message": "{\"level\":\"info\", \"record\":{\"query_id\":\"8765432109876543210\", \"database_name\":\"production\", \"message\":\"duration: 245.678 ms plan:\\n{...}\" } }", "kubernetes": { "pod_labels": { "cnpg.io/cluster": "myapp-postgres" } } }	{ "_time": "2025-01-15T14:32:18.456Z", "cluster_name": "myapp-postgres", "namespace": "apps", "database": "production", "query_id": "8765432109876543210", "duration_ms": 245.678, "query_text": "SELECT users.email...", "plan_json": { "Node Type": "Hash Join", "..." } }

Raw Log (CloudNativePG)

After Vector Parsing

{
  "timestamp": "2025-01-15T14:32:18.456Z",
  "message": "{\"level\":\"info\",
    \"record\":{\"query_id\":\"8765432109876543210\",
      \"database_name\":\"production\",
      \"message\":\"duration: 245.678 ms plan:\\n{...}\"
    }
  }",
  "kubernetes": {
    "pod_labels": {
      "cnpg.io/cluster": "myapp-postgres"
    }
  }
}

{
  "_time": "2025-01-15T14:32:18.456Z",
  "cluster_name": "myapp-postgres",
  "namespace": "apps",
  "database": "production",
  "query_id": "8765432109876543210",
  "duration_ms": 245.678,
  "query_text": "SELECT users.email...",
  "plan_json": {
    "Node Type": "Hash Join",
    "..."
  }
}

3-Step Pipeline

The Vector pipeline consists of 3 transforms and 2 sinks (complete configuration):

1. Parse CloudNativePG JSON logs

1if .kubernetes.container_name == "postgres" && exists(.kubernetes.pod_labels."cnpg.io/cluster") {
2  .log = parse_json(.message)
3}

2. Filter for execution plans

1exists(.log.record.message) && contains(.log.record.message, "plan:")

3. Extract metadata and plan

1.query_id = to_string!(.log.record.query_id)  # Correlation key
2.cluster_name = .kubernetes.pod_labels."cnpg.io/cluster"
3.database = .log.record.database_name
4# Parse plan JSON from "duration: X ms plan: {...}"
5.plan_json = parse_json(split(.log.record.message, "plan:")[1])

Parsed events are sent to two sinks:

Successful plans → VictoriaLogs with indexing on cluster_name,namespace,database,query_id
Parse failures → Separate stream for debugging

The Key: Correlation via query_id

The critical element is query_id which appears in both systems:

VictoriaMetrics: pg_stat_statements{queryid="8765432109876543210"} (metrics)
VictoriaLogs: {query_id="8765432109876543210"} (plans)

This correlation allows instant jumping from a performance metric to execution plan history.

🔬 Analyzing Query Performance in Action

Once we've identified a problematic query, we can view its execution plan history in VictoriaLogs. Using the query_id from pg_stat_statements, we can query VictoriaLogs:

1# Find all execution plans for a specific query
2{cluster_name="myapp-postgres", query_id="1234567890"} | limit 50

This shows us:

All captured execution plans for this query over time
Plan variations (e.g., index scans vs. sequential scans)
Actual row counts and execution times
Buffer usage and I/O statistics

Understanding EXPLAIN Output

When auto_explain captures a plan, it provides detailed information:

 1{
 2  "Query Text": "SELECT * FROM users WHERE email = ?",
 3  "Query Identifier": 1234567890,
 4  "Duration": 1567.234,
 5  "Plan": {
 6    "Node Type": "Seq Scan",
 7    "Relation Name": "users",
 8    "Actual Rows": 1,
 9    "Actual Loops": 1,
10    "Actual Total Time": 1567.123,
11    "Shared Hit Blocks": 0,
12    "Shared Read Blocks": 54321
13  }
14}

Key insights from this plan:

Sequential Scan: Scanning the entire table instead of using an index
High block reads: 54,321 blocks read from disk (poor cache usage)
Single row returned: Despite scanning the entire table

This immediately suggests the need for an index on the email column.

🎨 Visualizing Execution Plans with pev2

Understanding complex execution plans from logs can be challenging. This is where pev2 (PostgreSQL Explain Visualizer 2) becomes very useful. It's a web tool that transforms JSON execution plans into interactive, visual diagrams.

To ensure sensitive query data never leaves the network, pev2 is self-hosted in the cluster via the App composition. This once again demonstrates the platform abstraction level: deploying a static web tool uses the same declarative API as a complete application with a database.

 1apiVersion: cloud.ogenki.io/v1alpha1
 2kind: App
 3metadata:
 4  name: xplane-pev2
 5  namespace: tooling
 6spec:
 7  image:
 8    repository: ghcr.io/smana/pev2
 9    tag: "v1.17.0"
10
11  resources:
12    requests:
13      cpu: "10m"
14      memory: "32Mi"
15    limits:
16      cpu: "300m"
17      memory: "128Mi"
18
19  # Accessible only via Tailscale VPN at: https://pev2.priv.cloud.ogenki.io
20  route:
21    enabled: true
22    hostname: "pev2" # Results in: pev2.priv.cloud.ogenki.io

Analyzing with Grafana

Grafana integration allows quick identification of problematic queries and navigation to their execution plans.

Performance Analysis Dashboard

Grafana Dashboard - Performance Analysis

This dashboard displays key metrics from pg_stat_statements: top queries by total duration, average latency, number of calls. Each query_id is clickable to explore details.

Correlation Dashboard

Grafana Dashboard - Metrics/Logs Correlation

This dashboard correlates metrics (VictoriaMetrics) with execution plans (VictoriaLogs) for a specific query. It shows performance evolution and plan changes over time.

Workflow: From Grafana to pev2

The video below shows the complete investigation workflow: from identifying a slow query in Grafana to visual plan analysis with pev2.

Workflow steps:

Identify the slow query in the Grafana dashboard (pg_stat_statements metrics)
Click on the query_id to view plan history in VictoriaLogs
Copy the JSON plan and open pev2 (https://pev2.priv.cloud.ogenki.io)
Paste the plan (Ctrl+V) to visualize execution

Leveraging pev2

pev2 transforms JSON plans into interactive diagrams that instantly reveal:

Bottlenecks: Larger nodes = higher execution time (orange/yellow badges = warnings)
Planner estimates: Discrepancies between estimated and actual rows (e.g., "under estimated by 443×" visible on the main Hash Join)
Inefficient sequential scans: Indicators on large tables suggesting missing indexes
Join strategies: Hash Join, Nested Loop, Merge Join with their respective costs
I/O patterns: Ratios of blocks read from disk vs. cache (buffer hits)

The interactive interface allows clicking on each node to view details (cost, timing, row count, buffers). Warning badges immediately signal potential issues (wrong estimates, inefficient scans).

For performance regressions, VictoriaLogs allows comparing plans before/after by filtering by time period (_time:[...]), revealing changes in PostgreSQL planner strategy.

💭 Final Thoughts

We've built in this article a complete PostgreSQL performance analysis system that combines metrics (pg_stat_statements), execution plans (auto_explain), and visualization (pev2). The key to this approach lies in correlation via query_id: from a Grafana dashboard showing a slow query, a few clicks are enough to navigate to its execution plan visualized in pev2, enabling performance analysis and optimization.

This is, once again, a demonstration of the power of available open source tools. CloudNativePG with added extensions, VictoriaMetrics and VictoriaLogs efficiently store metrics and logs, Vector parses and structures data, and Grafana offers unified visualization. This Kubernetes-native approach is portable and gives complete control.

Furthermore the App and SQLInstance Crossplane compositions allow to enable Performance Insights only by setting performanceInsights.enabled: true with a few tuning parameters (sampleRate, minDuration). Developers don't need to understand PostgreSQL internals or Vector—the platform masks the complexity. This same declarative API deploys both a complete database and a static web tool like pev2, demonstrating the consistency of the abstraction level.

The cloud-native-ref project brings all these pieces together and shows how Gateway API, Tailscale, Crossplane/KCL, and the VictoriaMetrics ecosystem assemble to create a complete observability platform.

Performance Consideration

Enabling Performance Insights involves an estimated overhead of 3-4% CPU and ~200-250MB memory with default values:

pg_stat_statements: ~1% CPU, 50-100MB RAM
auto_explain (sample_rate=0.2, log_timing=off): ~1% CPU, 50-100MB RAM
Vector parsing: <1% CPU, ~128MB RAM

The default values (sampleRate: 0.2, minDuration: 1000) are suitable for production. Adjust according to your needs:

High-traffic production: sampleRate: 0.1 (10%) + minDuration: 3000 (>3s) — reduce overhead to ~2-3% CPU
Debugging/Development: sampleRate: 1.0 (100%) + minDuration: 0 + log_timing: on — ~5-7% CPU overhead for maximum capture
Standard production: default values (sampleRate: 0.2, minDuration: 1000) — excellent balance at 3-4% CPU

The sampleRate, log_timing, and logStatement parameters allow fine-tuning performance impact.

🔖 References

CloudNativePG Documentation

PostgreSQL Extensions and Features

pg_stat_statements - Query statistics extension
auto_explain - Automatic execution plan logging
compute_query_id - Query identifier generation
EXPLAIN Documentation - Understanding query plans

VictoriaMetrics Ecosystem

Query Plan Visualization

pev2 (PostgreSQL Explain Visualizer 2) - Interactive execution plan visualization
pev2 Docker Image - Self-hosted deployment

Configuration and Implementation

Vector VRL Configuration - PostgreSQL log parsing pipeline
CloudNativePG Composition - SQLInstance abstraction with KCL
PostgreSQL Monitoring Architecture - Complete architecture documentation

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Ogenki

📊 Included with CloudNativePG

Collecting Metrics with VictoriaMetrics

Main Metrics to Monitor

Visualization with Grafana

🔍 Understanding Query Performance

✨ Query Plan History: Implementation with Open Source Tools

The Architecture

Enabling Performance Insights

🏗️ Platform Engineering: The Right Level of Abstraction

Understanding the Configuration

Vector Log Pipeline Configuration

3-Step Pipeline

The Key: Correlation via query_id

🔬 Analyzing Query Performance in Action

Understanding EXPLAIN Output

🎨 Visualizing Execution Plans with pev2

Analyzing with Grafana

Workflow: From Grafana to pev2

Leveraging pev2

💭 Final Thoughts

🔖 References