


























Having basic observability when running PostgreSQL in production isn't optional. Whether it's tracking the Golden Signals, monitoring slow queries, or analyzing connection patterns, this often requires plumbing multiple tools together, configuring various exporters, and manually extracting queries for analysis.
❓ What if we could achieve complete production-grade monitoring with minimal configuration?
In this article, we'll explore how to enhance CloudNativePG's already robust monitoring with simple, effective, and easy-to-use query performance analysis capabilities—leveraging the power of tools at our disposal: Vector, VictoriaMetrics, and VictoriaLogs.
When you deploy a PostgreSQL cluster with CNPG, the operator exposes comprehensive metrics via a dedicated endpoint on each PostgreSQL instance (port 9187). While Grafana dashboards are not included by default, CloudNativePG provides official dashboard templates that can be deployed declaratively.
The metrics exposed include:
GitOps and Kubernetes Operators
![]() | The examples in this article come from configurations available in the Cloud Native Ref repository. This project aims to quickly bootstrap a complete platform that follows best practices in automation, monitoring, security, and more. |
CloudNativePG automatically exposes metrics on each PostgreSQL pod. Collecting them is enabled by setting this Helm value:
1# CloudNativePG Helm chart
2monitoring:
3 podMonitorEnabled: true
This simple configuration creates a PodMonitor (Prometheus Operator resource) that is automatically converted by the VictoriaMetrics operator into a compatible native resource. Metrics from all PostgreSQL pods (primary and replicas) are thus collected and available in VictoriaMetrics.
Prometheus Compatibility
The VictoriaMetrics operator automatically converts Prometheus Operator resources (PodMonitor, ServiceMonitor, etc.) into their VictoriaMetrics equivalents. This transparent conversion allows using CloudNativePG without modification, while benefiting from VictoriaMetrics as the storage backend.
CloudNativePG exposes metrics that align perfectly with the Golden Signals methodology discussed in previous articles:
Latency ⏳
1# Average query duration
2rate(cnpg_backends_total_seconds_sum[5m]) / rate(cnpg_backends_total_seconds_count[5m])
Traffic 📶
1# Transactions per second
2rate(pg_stat_database_xact_commit[5m]) + rate(pg_stat_database_xact_rollback[5m])
Errors ❌
1# Connection failures and deadlocks
2rate(pg_stat_database_deadlocks[5m])
3rate(cnpg_pg_postmaster_start_time_seconds[5m])
Saturation 📈
1# Connection pool usage
2cnpg_backends_total / cnpg_pg_settings_max_connections
3
4# Cache hit ratio (should be > 95%)
5sum(rate(pg_stat_database_blks_hit[5m])) /
6 (sum(rate(pg_stat_database_blks_hit[5m])) + sum(rate(pg_stat_database_blks_read[5m])))
Using the Grafana Operator explored in previous articles, we can deploy CNPG dashboards declaratively:
1apiVersion: grafana.integreatly.org/v1beta1
2kind: GrafanaDashboard
3metadata:
4 name: databases-cloudnative-pg
5spec:
6 allowCrossNamespaceImport: true
7 folderRef: "databases"
8 datasources:
9 - inputName: "DS_PROMETHEUS"
10 datasourceName: "VictoriaMetrics"
11 instanceSelector:
12 matchLabels:
13 dashboards: "grafana"
14 url: "https://grafana.com/api/dashboards/20417/revisions/4/download"
The dashboard provides detailed views of our PostgreSQL clusters, including replication lag, query performance, and resource utilization.

While metrics give us the "what" and "when," they don't always tell us the "why." This is where query performance analysis becomes essential. Knowing that queries are slow is useful; understanding why they're slow often leads us to optimization opportunities.
Traditional PostgreSQL query analysis requires manually running EXPLAIN and EXPLAIN ANALYZE commands. While powerful, this approach has limitations:
Ideally, we need automatic and continuous query plan capture that allows us to:
This is exactly what some managed solutions offer. But can we achieve the same on Kubernetes?
The good news is that we can build a sophisticated query performance monitoring system that rivals commercial offerings with minimal configuration.

Our solution leverages two PostgreSQL extensions and a configuration parameter, integrating them with the VictoriaMetrics ecosystem:
PostgreSQL Extensions:
PostgreSQL Configuration:
Observability Stack:
The key is the link between all these elements: the correlation between metrics and logs using the query identifier. This allows us to:
I've called this feature "Performance Insights". Any resemblance to an existing solution would be purely coincidental 😆.
Thanks to CloudNativePG's "Managed Extensions" feature, enabling PostgreSQL extensons is very easy.
One of the key principles of platform engineering is providing the right level of abstraction to application developers. They shouldn't need to understand PostgreSQL internals or memorize 15+ PostgreSQL-specific configuration parameters.
This is where Crossplane compositions shine. In the Cloud Native Ref project, we use Crossplane with KCL (Kubernetes Configuration Language) to create a higher-level abstraction called SQLInstance.
Without Composition (Raw CNPG Cluster):
1apiVersion: postgresql.cnpg.io/v1
2kind: Cluster
3metadata:
4 name: myapp-postgres
5spec:
6 instances: 3
7 postgresql:
8 shared_preload_libraries:
9 - pg_stat_statements
10 - auto_explain
11 parameters:
12 pg_stat_statements.max: "10000"
13 pg_stat_statements.track: all
14 pg_stat_statements.track_utility: "on"
15 pg_stat_statements.track_planning: "on"
16 pg_stat_statements.save: "on"
17 auto_explain.log_format: json
18 auto_explain.log_min_duration: "1000"
19 auto_explain.log_analyze: "on"
20 auto_explain.log_buffers: "on"
21 auto_explain.log_timing: "off" # Optimization to reduce overhead (PostgreSQL default: on)
22 auto_explain.log_triggers: "on"
23 auto_explain.log_verbose: "on"
24 auto_explain.log_nested_statements: "on"
25 auto_explain.sample_rate: "0.2"
26 compute_query_id: on
27 track_activity_query_size: 2048
28 track_io_timing: "on"
29 log_min_duration_statement: 1000
30 # ... and more
With Composition (Platform Engineering Approach):
1apiVersion: cloud.ogenki.io/v1alpha1
2kind: App
3metadata:
4 name: myapp
5spec:
6 sqlInstance:
7 enabled: true
8 size: small
9 storageSize: 20Gi
10 instances: 3
11 performanceInsights:
12 enabled: true
13 explain:
14 sampleRate: 0.2 # 20% sampling (default: safe for production)
15 minDuration: 1000 # Log queries > 1 second (default)
16 logStatement: none # Optional: none (default) / ddl / mod / all
The Crossplane composition SQLInstance handles all the complexity.
This composition approach provides several benefits:
performanceInsights: true) rather than memorizing parameter namesPlatform Engineering Principle
The best abstractions hide complexity without limiting power. Developers get performance insights with simple parameters, while the platform team is still able to fine-tune the underlying PostgreSQL configuration for advanced use cases.
Let's break down what each component does:
pg_stat_statements: This extension tracks execution statistics for all SQL statements executed by a server. It records:
auto_explain: Automatically logs execution plans for queries exceeding a duration threshold. Key parameters include:
log_format: json: Structured output for parsinglog_min_duration: 1000: Capture queries taking more than 1 second (default)log_analyze: on: Include actual row counts (includes ANALYZE data from the actual execution)sample_rate: 0.2: Sample 20% of slow queries to reduce overhead (composition default; PostgreSQL default is 1.0)compute_query_id: The correlation key that ties everything together. This generates a unique identifier for each query that appears in both pg_stat_statements metrics and auto_explain logs.
Composition Default Values
The SQLInstance composition sets production-safe defaults (different from PostgreSQL defaults):
sampleRate: 0.2 → 20% sampling (PostgreSQL default: 1.0 = 100%)minDuration: 1000ms → Queries > 1 second (PostgreSQL default: -1 = disabled)For debugging, you can override these:
sampleRate: 1.0 → 100% of slow queriesminDuration: 0 → All queries, even the fastest onesHere's what Vector does concretely - transforming a PostgreSQL auto_explain log into an indexable event:
| Raw Log (CloudNativePG) | After Vector Parsing |
|---|---|
{ | { |
The Vector pipeline consists of 3 transforms and 2 sinks (complete configuration):
1. Parse CloudNativePG JSON logs
1if .kubernetes.container_name == "postgres" && exists(.kubernetes.pod_labels."cnpg.io/cluster") {
2 .log = parse_json(.message)
3}
2. Filter for execution plans
1exists(.log.record.message) && contains(.log.record.message, "plan:")
3. Extract metadata and plan
1.query_id = to_string!(.log.record.query_id) # Correlation key
2.cluster_name = .kubernetes.pod_labels."cnpg.io/cluster"
3.database = .log.record.database_name
4# Parse plan JSON from "duration: X ms plan: {...}"
5.plan_json = parse_json(split(.log.record.message, "plan:")[1])
Parsed events are sent to two sinks:
cluster_name,namespace,database,query_idThe critical element is query_id which appears in both systems:
pg_stat_statements{queryid="8765432109876543210"} (metrics){query_id="8765432109876543210"} (plans)This correlation allows instant jumping from a performance metric to execution plan history.
Once we've identified a problematic query, we can view its execution plan history in VictoriaLogs. Using the query_id from pg_stat_statements, we can query VictoriaLogs:
1# Find all execution plans for a specific query
2{cluster_name="myapp-postgres", query_id="1234567890"} | limit 50
This shows us:
When auto_explain captures a plan, it provides detailed information:
1{
2 "Query Text": "SELECT * FROM users WHERE email = ?",
3 "Query Identifier": 1234567890,
4 "Duration": 1567.234,
5 "Plan": {
6 "Node Type": "Seq Scan",
7 "Relation Name": "users",
8 "Actual Rows": 1,
9 "Actual Loops": 1,
10 "Actual Total Time": 1567.123,
11 "Shared Hit Blocks": 0,
12 "Shared Read Blocks": 54321
13 }
14}
Key insights from this plan:
This immediately suggests the need for an index on the email column.
Understanding complex execution plans from logs can be challenging. This is where pev2 (PostgreSQL Explain Visualizer 2) becomes very useful. It's a web tool that transforms JSON execution plans into interactive, visual diagrams.
To ensure sensitive query data never leaves the network, pev2 is self-hosted in the cluster via the App composition. This once again demonstrates the platform abstraction level: deploying a static web tool uses the same declarative API as a complete application with a database.
1apiVersion: cloud.ogenki.io/v1alpha1
2kind: App
3metadata:
4 name: xplane-pev2
5 namespace: tooling
6spec:
7 image:
8 repository: ghcr.io/smana/pev2
9 tag: "v1.17.0"
10
11 resources:
12 requests:
13 cpu: "10m"
14 memory: "32Mi"
15 limits:
16 cpu: "300m"
17 memory: "128Mi"
18
19 # Accessible only via Tailscale VPN at: https://pev2.priv.cloud.ogenki.io
20 route:
21 enabled: true
22 hostname: "pev2" # Results in: pev2.priv.cloud.ogenki.io
Grafana integration allows quick identification of problematic queries and navigation to their execution plans.
Performance Analysis Dashboard

This dashboard displays key metrics from pg_stat_statements: top queries by total duration, average latency, number of calls. Each query_id is clickable to explore details.
Correlation Dashboard

This dashboard correlates metrics (VictoriaMetrics) with execution plans (VictoriaLogs) for a specific query. It shows performance evolution and plan changes over time.
The video below shows the complete investigation workflow: from identifying a slow query in Grafana to visual plan analysis with pev2.
Workflow steps:
pg_stat_statements metrics)https://pev2.priv.cloud.ogenki.io)
pev2 transforms JSON plans into interactive diagrams that instantly reveal:
The interactive interface allows clicking on each node to view details (cost, timing, row count, buffers). Warning badges immediately signal potential issues (wrong estimates, inefficient scans).
For performance regressions, VictoriaLogs allows comparing plans before/after by filtering by time period (_time:[...]), revealing changes in PostgreSQL planner strategy.
We've built in this article a complete PostgreSQL performance analysis system that combines metrics (pg_stat_statements), execution plans (auto_explain), and visualization (pev2). The key to this approach lies in correlation via query_id: from a Grafana dashboard showing a slow query, a few clicks are enough to navigate to its execution plan visualized in pev2, enabling performance analysis and optimization.
This is, once again, a demonstration of the power of available open source tools. CloudNativePG with added extensions, VictoriaMetrics and VictoriaLogs efficiently store metrics and logs, Vector parses and structures data, and Grafana offers unified visualization. This Kubernetes-native approach is portable and gives complete control.
Furthermore the App and SQLInstance Crossplane compositions allow to enable Performance Insights only by setting performanceInsights.enabled: true with a few tuning parameters (sampleRate, minDuration). Developers don't need to understand PostgreSQL internals or Vector—the platform masks the complexity. This same declarative API deploys both a complete database and a static web tool like pev2, demonstrating the consistency of the abstraction level.
The cloud-native-ref project brings all these pieces together and shows how Gateway API, Tailscale, Crossplane/KCL, and the VictoriaMetrics ecosystem assemble to create a complete observability platform.
Performance Consideration
Enabling Performance Insights involves an estimated overhead of 3-4% CPU and ~200-250MB memory with default values:
pg_stat_statements: ~1% CPU, 50-100MB RAMauto_explain (sample_rate=0.2, log_timing=off): ~1% CPU, 50-100MB RAMVector parsing: <1% CPU, ~128MB RAMThe default values (sampleRate: 0.2, minDuration: 1000) are suitable for production. Adjust according to your needs:
sampleRate: 0.1 (10%) + minDuration: 3000 (>3s) — reduce overhead to ~2-3% CPUsampleRate: 1.0 (100%) + minDuration: 0 + log_timing: on — ~5-7% CPU overhead for maximum capturesampleRate: 0.2, minDuration: 1000) — excellent balance at 3-4% CPUThe sampleRate, log_timing, and logStatement parameters allow fine-tuning performance impact.
CloudNativePG Documentation
PostgreSQL Extensions and Features
VictoriaMetrics Ecosystem
Query Plan Visualization
Configuration and Implementation
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。