


























How are our applications performing? 👁️
Once our application is deployed, it is essential to have indicators that help identify potential issues and track performance changes. Among these sources of information, metrics and logs play an essential role by providing valuable insights into the application's operation. Additionally, it is often useful to implement detailed tracing to accurately track all actions performed within the application.
In this series of blog posts, we will explore the various areas of application monitoring. The goal is to thoroughly analyze the state of our applications, in order to improve their availability and performance, while ensuring an optimal user experience.
In a previous blog post, we've seen how to collect and visualize metrics. These metrics allow us to analyze our applications' behavior and performance. It's also crucial to configure alerts to be notified of misbehaviours on our platform.
Here we assume you already have:
Setting up relevant alerts essential any observability strategy. However, defining appropriate thresholds and avoiding alert fatigue requires a thoughtful and methodical approach.
We'll see in this article that it's very easy to set thresholds beyond which we would be notified. However, making these alerts relevant isn't always straightforward.

A properly configured alert allows us to identify and resolve problems within our system proactively, before the situation becomes worse. Effective alerts should:
Therefore, it's important to focus on a controlled number of metrics to monitor. There are approaches that allow us to implement effective monitoring of our systems. Here we'll focus on two widely used alert models: Core Web Vitals and Golden Signals.
Core Web Vitals are metrics developed by Google to evaluate the user experience on web applications. They highlight metrics related to end-user satisfaction and help ensure our application offers good performance for real users. These metrics focus on three main aspects:

Largest Contentful Paint (LCP), Page Load Time: LCP measures the time needed for the largest visible content element on a web page (for example, an image, video, or large text block) to be fully rendered in the web browser. A good LCP is below 2.5 seconds.
Interaction to Next Paint (INP), Responsiveness: INP evaluates a web page's responsiveness by measuring the latency of all user interactions, such as clicks, taps, and keyboard inputs, etc. It reflects the time needed for a page to visually respond to an interaction, that is, the delay before the browser displays the next render after a user action. A good INP should be less than 200 milliseconds
Cumulative Layout Shift (CLS), Visual Stability: CLS evaluates visual stability by quantifying unexpected layout shifts on a page, when elements move during loading or interaction. A good CLS score is less than or equal to 0.1.
A website's performance is considered satisfactory if it reaches the thresholds described above at the 75th percentile, thus favoring a good user experience and, consequently, better retention and search engine optimization (SEO).
Be Careful with Core Web Vitals Alerts
Adding specific alerts for these metrics requires careful consideration. Unlike classic metrics, such as availability or error rates, which directly reflect system stability, Web Vitals depend on many external factors, such as users' network conditions or their devices, making thresholds more complex to monitor effectively.
To avoid unnecessary alert overload, these alerts should only target significant degradations. For example, a sudden increase in CLS (visual stability) or a continuous deterioration of LCP (load time) over several days might indicate important problems requiring intervention.
Finally, these alerts require appropriate tools, such as RUM (Real User Monitoring) for real data or Synthetic Monitoring for simulated tests, which require a specific solution not covered in this article.

The Golden Signals are a set of four key metrics, widely used in the field of system and application monitoring, particularly with tools like Prometheus. These signals allow effective monitoring of application health and performance. They are particularly appropriate in the context of a distributed architecture:
Latency ⏳: It includes both successful request time and failed request time. Latency is crucial because an increase in response time can indicate performance problems.
Traffic 📶: It can be measured in terms of requests per second, data throughput, or other metrics that express system load.
Errors ❌: This is the failure rate of requests or transactions. This can include application errors, infrastructure errors, or any situation where a request didn't complete correctly (for example, HTTP 5xx responses or rejected requests).
Saturation 📈: This is a measure of system resource usage, such as CPU, memory, or network bandwidth. Saturation indicates how close the system is to its limits. A saturated system can lead to slowdowns or failures.
These Golden Signals are essential because they allow us to focus monitoring on critical aspects that can quickly affect user experience or overall system performance. With Prometheus, these signals are often monitored via specific metrics to trigger alerts when certain thresholds are exceeded.
Other Methods and Metrics
I've mentioned here two methodologies that I find are a good starting point for optimizing our alerting system. That said, others exist, each with their specificities. We can mention USE or RED among others.
Similarly, beyond the Core Web Vitals presented above, other web metrics like FCP (First Contentful Paint) or TTFB (Time To First Byte) can prove useful depending on your specific needs.
The main thing is to keep in mind that a good alerting strategy relies on a targeted set of relevant metrics 🎯
You got it: Defining alerts requires thought! Now let's get practical and see how to define thresholds from our metrics.
Metrics collected with Prometheus can be queried using a specific language called PromQL (Prometheus Query Language). This language allows extracting monitoring data, performing calculations, aggregating results, applying filters, and also configuring alerts.
(ℹ️ Refer to the previous article to understand what we mean by metric.)
PromQL is a powerful language, here are some simple examples applied to metrics exposed by an Nginx web server:
Total number of processed requests (nginx_http_requests_total) - returns the total count since server start:
1nginx_http_requests_total
Request rate over a 5-minute window - calculates requests per second:
1rate(nginx_http_requests_total[5m])
Error rate - calculates 5xx errors per second over the last 5 minutes:
1rate(nginx_http_requests_total{status=~"5.."}[5m])
Request rate by pod - calculates requests/sec for each pod in namespace "myns":
1sum(rate(nginx_http_requests_total{namespace="myns"}[5m])) by (pod)
💡 In the examples above, we made use of two Golden Signals: traffic 📶 and errors ❌.
MetricsQL is the language used with VictoriaMetrics. It aims to be compatible with PromQL with slight differences that make it easier to write complex queries.
It also brings new functions, here are some examples:
histogram(q): This function calculates a histogram for each group of points having the same timestamp, which is useful for visualizing a large number of time series via a heatmap.
To create a histogram of HTTP requests:
1histogram(rate(vm_http_requests_total[5m]))
quantiles("phiLabel", phi1, ..., phiN, q): Used to extract multiple quantiles (or percentiles) from a given metric.
To calculate the 50th, 90th, and 99th percentiles of HTTP request rate:
1quantiles("percentile", 0.5, 0.9, 0.99, rate(vm_http_requests_total[5m]))
To test your queries, you can use the demo provided by VictoriaMetrics: https://play.victoriametrics.com

VictoriaMetrics offers two essential components for alert management:
VMAlert is the component that continuously evaluates defined alert rules. It supports two types of rules:
Recording Rules 📊 Recording rules allow pre-calculating complex PromQL expressions and storing them as new metrics to optimize performance.
Alerting Rules 🚨 Alerting rules define conditions that trigger alerts when certain thresholds are exceeded.
In this blog post, we'll focus on alerting rules which are essential for proactive problem detection.
Concrete Examples
![]() | The rest of this article comes from a set of configurations you can find in the Cloud Native Ref repository. This project aims to quickly start a complete platform that applies best practices in terms of automation, monitoring, security, etc. |
VMRuleWe've seen previously that VictoriaMetrics provides a Kubernetes operator that allows managing different components declaratively. Among the available custom resources, VMRule allows defining alerts and recording rules.
If you've already used the Prometheus operator, you'll find a very similar syntax as the VictoriaMetrics operator is compatible with Prometheus custom resources. (This allows to migrate easily 😉).
Let's take a concrete example with a VMRule that monitors the health state of Flux resources:
flux/observability/vmrule.yaml
1apiVersion: operator.victoriametrics.com/v1beta1
2kind: VMRule
3metadata:
4 labels:
5 prometheus-instance: main
6 name: flux-system
7 namespace: flux-system
8spec:
9 groups:
10 - name: flux-system
11 rules:
12 - alert: FluxReconciliationFailure
13 annotations:
14 message: Flux resource has been unhealthy for more than 5m
15 description: "{{ $labels.kind }} {{ $labels.exported_namespace }}/{{ $labels.name }} reconciliation has been failing for more than ten minutes."
16 runbook_url: "https://fluxcd.io/flux/cheatsheets/troubleshooting/"
17 dashboard: "https://grafana.priv.${domain_name}/dashboards"
18 expr: max(gotk_reconcile_condition{status="False",type="Ready"}) by (exported_namespace, name, kind) + on(exported_namespace, name, kind) (max(gotk_reconcile_condition{status="Deleted"}) by (exported_namespace, name, kind)) * 2 == 1
19 for: 10m
20 labels:
21 severity: warning
It's recommended to follow some best practices to provide maximum context for quickly identifying the root cause.
Naming and Organization 📝
FluxReconciliationFailureflux-system, flux-controllers)Thresholds and Durations ⏱️
for: 10m to avoid false positivesLabels and Routing 🏷️
team label to route to the right team, or have different routing policies depending on the environment.1labels:
2 severity: [critical|warning|info]
3 team: [sre|dev|ops]
4 environment: [prod|staging|dev]
The Importance of Annotations 📚
Annotations allow adding various information about the alert context
1expr: |
2 max(gotk_reconcile_condition{status="False",type="Ready"}) by (exported_namespace, name, kind)
3 + on(exported_namespace, name, kind)
4 (max(gotk_reconcile_condition{status="Deleted"}) by (exported_namespace, name, kind)) * 2 == 1
gotk_reconcile_condition metric exposes the health state of Flux resourcesstatus="False",type="Ready" identifies resources that aren't in the "Ready" statestatus="Deleted") detects resources that have been deleted+ on(...) (...) * 2 == 1 combines these conditions to trigger an alert when:max and by allow grouping alerts by namespace, name, and resource typeWe can send these alerts through different channels or tools. We can mention Grafana OnCall, Opsgenie, PagerDuty, or simply emails, and more...
In our example, we're sending notifications to a Slack channel. We'll first create a Slack application and retrieve the generated token before configuring VictoriaMetrics.
Application Creation 🔧
Permission Configuration 🔑 In "OAuth & Permissions", add the following scopes:
chat:write (Required)chat:write.public (For posting in public channels)channels:read (For listing channels)groups:read (For private groups)xoxb-)The rest of the configuration is done using Helm values to configure AlertManager
observability/base/victoria-metrics-k8s-stack/vm-common-helm-values-configmap.yaml
1 alertmanager:
2 enabled: true
3 spec:
4 externalURL: "https://vmalertmanager-${cluster_name}.priv.${domain_name}"
5 secrets:
6 - "victoria-metrics-k8s-stack-alertmanager-slack-app"
7 config:
8 global:
9 slack_api_url: "https://slack.com/api/chat.postMessage"
10 http_config:
11 authorization:
12 credentials_file: /etc/vm/secrets/victoria-metrics-k8s-stack-alertmanager-slack-app/token
The External Secrets Operator retrieves the Slack token from AWS Secrets Manager and stores it in a Kubernetes secret named victoria-metrics-k8s-stack-alertmanager-slack-app. This secret is then referenced in the Helm values to configure AlertManager's authentication (config.global.http_config.authorization.credentials_file).
1 route:
2 group_by:
3 - cluster
4 - alertname
5 - severity
6 - namespace
7 group_interval: 5m
8 group_wait: 30s
9 repeat_interval: 3h
10 receiver: "slack-monitoring"
11 routes:
12 - matchers:
13 - alertname =~ "InfoInhibitor|Watchdog|KubeCPUOvercommit"
14 receiver: "blackhole"
15 receivers:
16 - name: "blackhole"
17 - name: "slack-monitoring"
Alert Grouping: Alert grouping is important to reduce noise and improve notification readability. Without grouping, each alert would be sent individually, which could quickly become unmanageable. The chosen grouping criteria allow logical organization:
group_by defines the labels to group alerts bygroup_wait: 30s delay before initial notification to allow groupinggroup_interval: 5m interval between notifications for the same grouprepeat_interval: Alerts are only repeated every 3h to avoid spamReceivers: Receivers are AlertManager components that define how and where to send alert notifications. They can be configured for different communication channels like Slack, Email, PagerDuty, etc. In our configuration:
slack-monitoring: Main receiver that sends alerts to a specific Slack channel with custom formattingblackhole: Special receiver that "absorbs" alerts without transmitting them anywhere, useful for filtering non-relevant or purely technical alertsRouting Example
Alert routing can be customized based on your team structure and needs. Here's a practical example:
Let's say your organization has an on-call team that needs to be notified immediately about urgent issues. You can route alerts to them when:
1 - matchers:
2 - environment =~ "prod|security"
3 - team = "oncall"
4 receiver: "pagerduty"
This configuration block defines a Slack receiver for AlertManager that uses Monzo templates. Monzo templates are a set of notification templates that allow formatting Slack alerts in an elegant and informative way.
1 alertmanager:
2 config:
3 receivers:
4 - name: "slack-monitoring"
5 slack_configs:
6 - channel: "#alerts"
7 send_resolved: true
8 title: '{{ template "slack.monzo.title" . }}'
9 icon_emoji: '{{ template "slack.monzo.icon_emoji" . }}'
10 color: '{{ template "slack.monzo.color" . }}'
11 text: '{{ template "slack.monzo.text" . }}'
12 actions:
13 - type: button
14 text: "Runbook :green_book:"
15 url: "{{ (index .Alerts 0).Annotations.runbook_url }}"
16 - type: button
17 text: "Query :mag:"
18 url: "{{ (index .Alerts 0).GeneratorURL }}"
19 - type: button
20 text: "Dashboard :grafana:"
21 url: "{{ (index .Alerts 0).Annotations.dashboard }}"
22 - type: button
23 text: "Silence :no_bell:"
24 url: '{{ template "__alert_silence_link" . }}'
25 - type: button
26 text: '{{ template "slack.monzo.link_button_text" . }}'
27 url: "{{ .CommonAnnotations.link_url }}"
The notification format shown below demonstrates how alerts can be enriched with interactive elements. Users can quickly access relevant information through action buttons that link to the Grafana dashboard 📊, view the associated runbook 📚, or silence noisy alerts 🔕 when needed.

VictoriaMetrics and its ecosystem provide multiple interfaces for managing and viewing alerts. Here are the main options available:
Alertmanager is the standard component that allows:

VMUI offers a simplified interface for:

Although we use Alertmanager for alert definition and routing, Grafana Alerting offers a complete alternative solution that allows:

Choosing the Right Interface
The choice of interface depends on your specific needs:
Defining relevant alerts is a key element of any observability strategy. The VictoriaMetrics operator, with its Kubernetes custom resources like VMRule, greatly simplifies setting up an effective alerting system. Declarative configuration allows quickly defining complex alert rules while maintaining excellent code readability and maintainability.
However, the technical configuration of alerts, even with powerful tools like VictoriaMetrics, isn't sufficient on its own. An effective alerting strategy must integrate into a broader organizational framework:
Going Further 🚀
Discover how to integrate these alerts with other components of your observability stack in upcoming articles in this series, particularly correlation with logs and distributed tracing.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。