惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Affairs
PCI Perspectives
PCI Perspectives
Google Online Security Blog
Google Online Security Blog
W
WeLiveSecurity
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Privacy & Cybersecurity Law Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
S
Security @ Cisco Blogs
Security Archives - TechRepublic
Security Archives - TechRepublic
Cyberwarzone
Cyberwarzone
L
Lohrmann on Cybersecurity
TaoSecurity Blog
TaoSecurity Blog
V
Visual Studio Blog
博客园 - 聂微东
Scott Helme
Scott Helme
博客园 - 【当耐特】
K
Kaspersky official blog
Security Latest
Security Latest
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
MyScale Blog
MyScale Blog
Schneier on Security
Schneier on Security
WordPress大学
WordPress大学
博客园 - 叶小钗
C
Check Point Blog
V2EX - 技术
V2EX - 技术
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - Franky
T
Tor Project blog
Apple Machine Learning Research
Apple Machine Learning Research
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
雷峰网
雷峰网
博客园_首页
美团技术团队
Y
Y Combinator Blog
C
CERT Recently Published Vulnerability Notes
AWS News Blog
AWS News Blog
月光博客
月光博客
N
Netflix TechBlog - Medium
Last Week in AI
Last Week in AI
Recent Announcements
Recent Announcements
Google DeepMind News
Google DeepMind News
Help Net Security
Help Net Security
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog
C
Cybersecurity and Infrastructure Security Agency CISA

VictoriaMetrics: Simple & Reliable Monitoring for Everyone on VictoriaMetrics

Operator now has Long-Term Support (LTS) version Multi-tiered Observability: A Practical Way to Handle Diverse Workloads VictoriaMetrics April 2026 Ecosystem Updates Not All Telemetry Requires Premium Pricing VictoriaMetrics at KubeCon Amsterdam: Community Highlights What's new in VictoriaMetrics Anomaly Detection (Q1 2026) What's New in VictoriaMetrics Cloud Q1 2026? Logs, MCP Server, Better Alerting, and... a Secret Project VictoriaMetrics at KubeCon: Optimizing Tail Sampling in OpenTelemetry with Retroactive Sampling VictoriaMetrics March 2026 Ecosystem Updates Observability Lessons From OpenAI Benchmarking Kubernetes Log Collectors: vlagent, Vector, Fluent Bit, OpenTelemetry Collector, and more VictoriaMetrics February 2026 Ecosystem Updates VictoriaMetrics at FOSDEM, Cloud Native Days France, and CfgMgmtCamp Ghent VictoriaLogs in VictoriaMetrics Cloud: Fast, Cost-Effective Log Management is Here What’s new in VictoriaMetrics Anomaly Detection (2025) VictoriaMetrics January 2026 Ecosystem Updates VictoriaLogs Basics: What You Need to Know, with Examples & Visuals What's New in VictoriaMetrics Cloud Q4 2025? New tiers, more deployment options, IaC and alerting rules. Vibe coding tools observability with VictoriaMetrics Stack and OpenTelemetry How a US Software Provider Improved Traffic Alerting with VictoriaMetrics Anomaly Detection VictoriaMetrics 2025 Developer Experience: A Year in Review Spotify’s performance & control across large monitoring environments with VictoriaMetrics VictoriaMetrics Achieves Red Hat OpenShift Operator Certification Our latest updates across the VictoriaMetrics Observability ecosystem New Capacity Tiers in VictoriaMetrics Cloud Announcing 1B+ Downloads & Product Development With Logs, Traces, Metrics AI Agents Observability with OpenTelemetry and the VictoriaMetrics Stack Discarding gRPC-Go: The Story Behind OTLP/gRPC Support in VictoriaTraces What's New in VictoriaMetrics Cloud Q3 2025? From new region in Asia to proactive alerts How DreamHost Slashed Memory Usage by 80% and Scaled to 76 Million Time Series Upcoming Conferences & Meetups: Where to Meet Our Team VictoriaMetrics Long-Term Support (LTS): H2 2025 Update Creating a Sustainable Open Source Business Model - Introduction Full-Stack Observability with VictoriaMetrics in the OTel Demo vmanomaly Deep Dive: Smarter Alerting with AI (Tech Talk Companion) VictoriaLogs Practical Ingestion Guide for Message, Time and Streams Monotonic and Wall Clock Time in the Go time package Hello Singapore! VictoriaMetrics Cloud Expands to Asia Pacific MCP Server Integration & Much More: What's New in VictoriaMetrics Cloud Q2 2025 FIPS 140-3 Compatible Builds for VictoriaMetrics Enterprise Components VictoriaLogs Unleashed: Cluster Version Now Available for Exceptional, Linear Scaling Integrations made easy with VictoriaMetrics Cloud Developer's Note: Research on Distributed Tracing, Comparing With Tempo and ClickHouse vmagent: Key Features Explained in Under 15 Minutes Go synctest: Solving Flaky Tests vmalert: Maximize Your Monitoring (Tech Talk Companion) Celebrating 14K Stars on GitHub: Spring Update vmalert: Maximize Your Monitoring VictoriaMetrics Connects with the Open Source Community at LinuxFest Northwest 2025 Graceful Shutdown in Go: Practical Patterns VictoriaLogs: Gaps, Gains & Growth Prometheus Monitoring: Functions, Subqueries, Operators, and Modifiers VictoriaMetrics Cloud: What's New in Q1 2025? Don’t default to microservices: You’ll thank us later! Container CPU Requests & Limits Explained with GOMAXPROCS Tuning gRPC in Go: Streaming RPCs, Interceptors, and Metadata From Chaos to Clarity with VictoriaLogs Prometheus Alerting 101: Rules, Recording Rules, and Alertmanager Heading to London: Meet Our Team at KubeCon Europe 2025 Inside vmselect: The Query Processing Engine of VictoriaMetrics Meet Our Team at Scale 22x Practical Protobuf - From Basic to Best Practices VictoriaLogs Status Update: Heading Towards the Cluster Version 24th of February 2025 Statement: VictoriaMetrics Stands with Ukraine! Prometheus Metrics Explained: Counters, Gauges, Histograms & Summaries Prometheus Monitoring: Instant Queries and Range Queries Explained 300%+ Growth in 2024: Join Our Team in 2025! FOSDEM 2025 recap How Protobuf Works—The Art of Data Encoding OpenTelemetry, Prometheus, and More: Which Is Better for Metrics Collection and Propagation? How vmstorage Handles Query Requests From vmselect How vmstorage's IndexDB Works VictoriaMetrics Tech Talk Stream: A Deep Dive into Blackbox Monitoring How HTTP/2 Works and How to Enable It in Go VictoriaMetrics Cloud: What's New in Q4 2024? How vmstorage Processes Data: Retention, Merging, Deduplication,... How vmstorage Handles Data Ingestion From vminsert When Metrics Meet vminsert: A Data-Delivery Story From net/rpc to gRPC in Go Applications VictoriaMetrics helps IHI Terrasun Win Big in Vegas on $1.2B Clean Energy Project Piros | VictoriaMetrics Partner Allenta | VictoriaMetrics Partner CloudRaft | VictoriaMetrics Partner Sensedia & VictoriaMetrics: API-compatible Efficient Storage Scalable Prometheus: Why DSV Chose VictoriaMetrics Sensor Factory | VictoriaMetrics Partner Erythix | VictoriaMetrics Partner Groove X & VictoriaMetrics: Faster Device Health Monitoring Scaled & Performant Monitoring at Spotify with VictoriaMetrics Grammarly & VictoriaMetrics: 10× Lower Costs & Direct Access Zelarsoft | VictoriaMetrics Partner DFKI & VictoriaMetrics: Efficient Long-Term Metric Storage Niubits | VictoriaMetrics Partner Megazone Cloud | VictoriaMetrics Partner Cogito Software | VictoriaMetrics Partner Bajau | VictoriaMetrics Partner Find Out Why Dig Security Chose VictoriaMetrics! Ness | VictoriaMetrics Partner Alpha Data | VictoriaMetrics Partner SIOS Technology | VictoriaMetrics Partner
Alerting Best Practices
Roman Khavronenko / Mathias Palmersheim · 2025-08-22 · via VictoriaMetrics: Simple & Reliable Monitoring for Everyone on VictoriaMetrics

A firing alert is like someone ringing your doorbell - it demands your immediate attention, interrupting whatever else you’re doing. It requires focus and a quick response.

But imagine trying to live in an apartment where the doorbell never stops ringing. You could put in earplugs to block the noise, but that only masks the problem - it doesn’t solve it.

On the other hand, disconnecting the doorbell entirely isn’t a solution either. You still want to know when your food or a package arrives.

A doorbell that’s always silent or always ringing is equally useless. The goal is to find the right balance - distinguishing between what truly matters and what doesn’t.

Every alert should be actionable

#

If you’re receiving alert notifications and consistently ignoring them, then those alerts shouldn’t have been triggered in the first place. Why go through the trouble of setting a “trap” only to ignore it when it springs?

Trap!

As engineers, we don’t appreciate the work automated alerting does. It tirelessly checks the conditions we asked it to check, daily and nightly. Only so we can get upset when it sends us notifications.

Imagine asking a colleague to monitor a server and let you know if something breaks. You gave clear instructions, and when they follow through - you ignore them. That colleague wouldn’t stay motivated for long.

So if you find yourself drowning in alerts or simply tuning them out, it’s a signal in itself: something needs to change. It’s time to take action.

Please read this outstanding article Prometheus Alerting 101: Rules, Recording Rules, and Alertmanager by Phuong Le to get the basics of the alerting in VictoriaMetrics ecosystem. The rest of the article will be dedicated to practical tips on improving the alerting experience.

Defining an alerting rule

#

The alerting rule consists of multiple fields. Let’s start from the most important ones:

Defining alert

Here, we define the alert RemoteWriteConnectionIsSaturated, which is supposed to notify us when metrics collector is unable to push data fast enough.

The alerting rule name should be descriptive, as it’s the first thing an on-call engineer will see. It should convey a basic understanding of the issue at a glance, before the engineer even reads the rest of the alert message.

Rule expression

#

A rule’s expr should satisfy the following criteria:

  1. It must describe a problematic system state that genuinely requires action from the on-call engineer. Test the expression against real data to see if it “catches” that problematic state.
  2. Verify that expression gives the expected results in more than one situation. Try it on longer time intervals, apply it to different environments.
  3. Make sure that expression returns labels that you actually need. For example, if you don’t care about a specific pod experiencing connection issues, then modify the query expression to produce alert per-job by wrapping it with max(...) by(job) > 0.9. This approach helps reduce alert noise when multiple pods within the same job are affected.

There are a bunch of common mistakes users make when configuring alerting rules. But we want to draw attention to the lookbehind window importance.

vmalert executes instant queries for rules. Instant queries are limited in how far VictoriaMetrics will look back when retrieving data points. For example, a simple rule like config_reload_error == 1 will only search for data points within a 5-minute window (controlled by -datasource.queryStep). So if config_reload_error scrape interval is >= 5 minutes, this query might miss valid data and produce false negatives, since the expected datapoint might fall just outside the query’s lookbehind window.

Lookbehind window

In this case, lookbehind window can be extended globally by setting -datasource.queryStep=15m (to always look behind for 15min) or by modifying the query to look for more than 5 minutes:

Lookbehind window 15m

Note: even if scrape_interval is <=5min, you should always account for the possibility of data delivery delay. See more details about data delay here.

The same issue applies to rollup functions with too short lookbehind window, like rate(http_request_errors_total[1m]). If http_request_errors_total scrape_interval is 1 minute, then this expression makes no sense as it needs to capture at least 2 data points to calculate the rate.

A good rule of thumb is to set the lookbehind window to at least 4x the scrape interval. This helps ensure accuracy and accounts for potential delays or missed scrapes.

The for param

#

The for param defines how long the expr returns data for a time series before the alert actually fires. Its primary purpose is to prevent alerts flapping caused by short-lived or transient issues.

For example, it’s normal for a vmagent connection to become temporarily saturated while the remote destination is restarting. But if the saturation persists for more than 15 minutes, it likely indicates a real problem that won’t resolve on its own.

The for parameter is one of the most effective tools for reducing noisy alerts. Some metrics - like CPU usage - are naturally spiky and prone to short bursts of high values. By increasing the for duration, you can filter out these harmless spikes and focus on sustained issues. For example, it helps distinguish between a CPU that occasionally handles heavy workloads and one that remains saturated over an extended period of time.

Note: the longer the for duration, the more time it takes for the alert to fire. Some alerts are too important to wait for 15 or 30 minutes. Choosing the right for value requires a good understanding of the signal you’re monitoring - how it behaves and how quickly you need to react when things go wrong.

The for param is also related to lookbehind window. For example, increase(http_request_errors_total[5m]) counts the number of errors over the last 5 minutes. If there’s even a single increment in that time, the expression will evaluate as true for the full 5-minute window, because the data point remains within the range.

In this case, setting for: 5m doesn’t add much value, since the alert will likely always remain active for at least that long. To make for meaningful in such cases, it should be set to a value greater than the lookbehind window. E.g., for: 10m when using [5m], to ensure you’re capturing a sustained condition, not just a single event.

The keep_firing_for param

#

The opposite of the for param is keep_firing_for. This setting delays alert resolution by keeping the alert active for a specified duration, even if the expr stops returning results.

By default, vmalert waits for the full for interval before firing an alert. However, it only needs one empty evaluation to resolve it. This can lead to alerts resolving too quickly in cases of brief data gaps or missing samples. For example, an alerting rule for CPU utilization gets enough above the threshold to become firing:

Volatile CPU signal

Now imagine that CPU usage drops slightly below the threshold once every 30 minutes—just enough to resolve the alert. A few minutes later, it rises above the threshold again and triggers a new alert.

This results in unnecessary alert noise and constant flapping. By setting a keep_firing_for interval, you can smooth out these fluctuations and avoid repetitive notifications for the same underlying issue.

Labels

#

Labels are metadata attached to each alert generated by a rule. They serve two primary purposes:

  1. Categorization – labels help classify the alert (e.g., by severity, team, or environment), allowing it to be properly routed to the right destination or on-call rotation.
  2. Enrichment – labels can add extra context that isn’t available in the original metric, such as static identifiers or tags useful for downstream processing.

Labels

Categorizing alerting notifications is useful for routing. For example, routing by alert’s severity label will notify the on-call person about warning-type alerts, while critical-type alerts will ping the Engineering Manager that something out of the ordinary is happening.

Another example is routing by department. Having labels team: platform and team: engineering can help send application-related alerts to developers, while alerts related to the platform will be sent to platform engineers.

Enriching alerts with additional information is especially useful when the same set of alerting rules is deployed across multiple environments. For example, if an alerting rule is running in the EMEA region, you can attach a label like region="EMEA". This allows the on-call engineer to immediately identify which region is affected, without needing to dig into the metric data.

Note: one of the common mistakes is setting label values to something dynamically changing, like $value. Since $value is changing on every rule evaluation, it will change the alert’s label set and reset the for duration.

Annotations

#

Annotations are a great way to provide more context about the alert or link to helpful resources.

Annotations

In the example above, the summary and description annotations serve as a simplified runbook. The reason for using annotations for this information instead of labels is that annotations are not stored in VictoriaMetrics. They’re only stored as part of the alert, making it an ideal place for detailed messages, dashboard links, and other long strings that would be challenging to store in VictoriaMetrics.

Ideally, alerts should include clear, actionable instructions directly in the notification, so engineers don’t need to look up an external runbook. If you can briefly explain how to respond to an alert, include that guidance in the annotations.

Another good example is dashboard annotation. It contains a link to a specific panel on VictoriaMetrics Grafana dashboard. When clicked, it takes the on-call engineer directly to a visual overview of the issue, showing historical context, related metrics, and other signals that can help diagnose and resolve the problem more effectively.

As you can see, we heavily use templating in annotations to enrich each unique alerting notification with personalized information.

It’s OK to use templates like $value or $labels in annotations, as annotations aren’t taken into account during for checks.

Annotations can be additionally enriched by executing arbitrary MetricsQL queries via query() template function:

annotations:
 message: |
   The configuration of the instances of the Alertmanager cluster `{{ $labels.namespace }}/{{ $labels.service }}` are out of sync.
   {{ range printf "alertmanager_config_hash{namespace=\"%s\",service=\"%s\"}" $labels.namespace $labels.service | query }}
   Configuration hash for pod {{ .Labels.pod }} is "{{ printf "%.f" .Value }}"
   {{ end }}

The message annotation above makes an extra query call to fetch alertmanager_config_hash metric for the triggered alert and prints it in the annotation text.

Improving user experience

#

Additional information, such as links to Alertmanager, links to silence the alert, and a link to view the alerting rule that generated the alert, are added automatically to all alerts by vmalert and Alertmanager. However, these URLs need to be changed from the defaults in most cases. The -external.url and -external.alert.source command-line flags in vmalert will change the external link users see in Alertmanager and in the notifications it sends. However, these will usually default to internal service URLs that users do not have access to. To make these links more useful, they should be configured to point to something like Grafana that users will have access to.

Configuring the -external.url allows to use the $externalURL variable in annotations and makes it easier to share rules across environments. For example:

- alert: Empty Alert Rules found
  expr: 'max(vmalert_alerting_rules_last_evaluation_series_fetched) by(group, alertname) == 0'
  annotations: 
    summary: empty alerting rules found
    description: "{{ $labels.alertname }} in {{ $labels.group }} does not match any series"
    dashboard: '{{ $externalURL }}/d/LzldHAVnz_vm/victoriametrics-vmalert-vm'

The rule above could be applied to multiple environments without any changes, even if the URL to the dashboard there is different.

Alerts history

#

“If you want to know the future, look at the past.”

During alerting rules evaluation, vmalert persists alerts state changes in the form of time series with names ALERTS and ALERTS_FOR_STATE. Using these metrics we can see the history of alerts state changes. For this purpose, we (originally attributed to Alexander Marshalov) have built a Grafana dashboard for alerts statistics:

Alerts history

With the help of the dashboard, we can see which alerts were too noisy, or which alerts have never fired. Both cases are suspicious.

When dealing with alerting fatigue, use this dashboard to find the noisiest alerting rules and inspect their configurations for possible optimizations. Remember, every alert should be actionable. If there is no action to take on firing alert - it shouldn’t exist.

See the Grafana dashboard here.

Reducing noise

#

Usually, the first alerts are defined for relatively slow workloads. For example, the first alert we created for our service looked like this:

- alert: RequestErrorsToAPI
  expr: increase(http_request_errors_total[5m]) > 0

It catches an unwanted state of the application generating errors for client requests.

This alert was very helpful when we ran one or two replicas of the application. But once we scaled to hundreds in many regions, receiving an alert for each overloaded replica might be overkill. Instead, we can modify the expression to notify us only about specific region experiencing issues:

- alert: RequestErrorsToAPI
  expr: sum(increase(vm_http_request_errors_total[5m])) by(region) > 0

With the updated expression, we will receive only one alert per region. So even if many replicas within the region will start serving errors - we will receive only one firing alert. It still will be actionable, it can contain links to the dashboard that would show us a more detailed situation. But we won’t be overwhelmed with too many notifications.

Maybe even a better approach would be to define error budget and send alerts only when this budget is burning too fast. This approach assumes that errors are acceptable to some level and expects notifying engineers only if the promised service level objective is heading to be breached.

Sometimes, alerts can start firing because of some incidents that are out of our control: power outage, datacenter failure, etc. These events could start a cascade of alerting notifications because monitored services depend on connectivity. It could be overwhelming to receive thousands of alerts at once, so we recommend configuring rules inhibition in Alertmanager. It effectively allows muting a set of alerts based on the presence of another set of alerts.

Testing alerts

#

Above, we recommended testing alerting rule expressions before applying them. But just running them in Grafana Explore or vmui could be not indicative, as such query doesn’t account for for or keep_firing_for params.

As a better approach, we recommend using vmalert-tool for unit-testing rules. Writing tests gives confidence and verification of the expression correctness. It is also a good practice to include such tests in Continuous Integration (CI) when changing rule definitions.

vmalert also supports a backfilling mechanism called replay. Via replay, it is possible to run alerting rules on the production data in the past just to see when alerts will or won’t trigger. Results of replay can be verified via Alerts history dashboard.

Summary

#

Proper alerting is an art. It is all about foreseeing bad scenarios before they happen, so you can prepare for them.

img.webp

VictoriaMetrics ecosystem provides all required tools for defining, testing and monitoring alerting processes. Please refer to the following resources:

  1. Prometheus Alerting 101: Rules, Recording Rules, and Alertmanager
  2. VictoriaMetrics Monitoring
  3. Never-firing alerts: What are they and how to deal with them
  4. https://docs.victoriametrics.com/victoriametrics/vmalert
  5. https://docs.victoriametrics.com/victoriametrics/vmalert-tool/