惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Datadog | The Monitor blog

How to audit and clean up monitors effectively How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability Reduce CVE noise with OpenVEX assessments in Datadog Diagnose slow PostgreSQL queries faster with explain plan correlation Explore Datadog metrics with Natural Language Queries Attribute AI costs across providers with Datadog Cloud Cost Management Simplify micro-frontend observability with Datadog RUM Toto 2.0: Time series forecasting enters the scaling era Diagnose and resolve database performance issues faster with Database Investigator Datadog for Government achieves FedRAMP® High certification Analyze cloud costs with flexible spreadsheets in Datadog Sheets Inside Datadog’s AI Research Lab: Meet two PhD candidates behind Toto Connect triage and investigation in a single workflow with Datadog Cloud SIEM This Month in Datadog - April 2026 Monitor and optimize Supabase query performance with Datadog Database Monitoring Add dynamically updating context to logs with Reference Tables and Observability Pipelines Introducing ARFBench: A time series question-answering benchmark based on real incidents Test network paths with TCP, UDP, and ICMP in Datadog The product signal latency gap slowing your growth How to investigate cloud credential compromise with Bits AI Security Analyst Evaluate, optimize, and secure your Google Cloud AI stack with Datadog Turn developer feedback into operational insight with Datadog Forms and Sheets Identify and fix code issues faster with Datadog’s Azure DevOps Source Code integration Steganography at scale: Embedding share URLs in Datadog widget screenshots Bringing observability data hosting to the UK on AWS Centralize observability management with Datadog Governance Console Every team should be A/B testing Manage service tracing across hosts with Single Step Instrumentation rules Route OTel data from AI apps to ClickHouse and Datadog using Observability Pipelines Spotting CI/CD misconfigurations before the bots do: Securing GitHub Actions with Datadog IaC Security Detect runtime threats in Python Lambda functions with Datadog AAP Offline evaluation for AI agents: Best practices Introducing our open source AI-native SAST Instrument and monitor Boomi integration flows with OpenTelemetry and Datadog Not all index scans are equal: How we cut query latency by over 99% Platform engineering metrics: What to measure and what to ignore Integrate Recorded Future threat intelligence with Datadog Cloud SIEM CI/CD security: threat modeling using a MITRE-style threat matrix CI/CD security: How to secure your GitHub ecosystem Ingress NGINX is EOL: A practical guide for migrating to Kubernetes Gateway API How we built a real-world evaluation platform for autonomous SRE agents at scale Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA Introducing the Datadog Code Security MCP Capture and analyze custom heatmaps in Session Replay Understand session replays faster with AI summaries and smart chapters Monitor ClickHouse query performance with Datadog Database Monitoring How we designed empathetic alert sounds for on-call engineers Search and act across Datadog to resolve issues faster with Bits Assistant Measure the business impact of every product change with Datadog Experiments Analyzing round trip query latency Configuring JavaScript caches for better performance Introducing Bits AI Dev Agent for Code Security Datadog achieves ISO 42001 certification for responsible AI Monitor Nutanix clusters, hosts, and VMs with Datadog Monitor Juniper Mist in Datadog A new Host Map for modern infrastructure When upserts don't update but still write: Debugging Postgres performance at scale Annotate traces to improve LLM quality with Datadog LLM Observability What's new in Cloud SIEM: AI-powered investigations, enhanced threat intelligence, and scalable security operations Explore Kubernetes with native OpenTelemetry data Monitor Oracle Fusion Cloud Applications with Datadog Announcing the Datadog Terraform provider v4.0.0 Scaling Kubernetes workloads on custom metrics How to design cloud environments for AI-powered threat analysis Monitor Aruba Central in Datadog How we centralize and remediate risks with Datadog Case Management Accelerate incident response with Datadog and ServiceNow Monitor your application and network load balancer logs Understanding Karpenter architecture for Kubernetes autoscaling Tools for collecting metrics and logs from Karpenter Monitor Karpenter with Datadog What your product data is actually saying Securing Datadog's platform in the AI age: The role of observability data Closing the verification loop: Observability-driven harnesses for building with agents Closing the verification loop, Part 2: Fully autonomous optimization When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos Four ways engineering teams use the Datadog MCP Server to power AI agents Approaching your observability migration with the right mindset Meet the new Bits AI SRE: Deeper reasoning, twice as fast Designing MCP tools for agents: Lessons from building Datadog's MCP server Key learnings from the 2026 State of DevSecOps study Use plain English to query your multi-cloud infrastructure in Resource Catalog Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring Protect your OCI resources with Datadog Cloud Security This Month in Datadog - February 2026 Fine-tune Toto for turbocharged forecasts Amazon EC2 security: How misconfigured and public AMIs expand your cloud attack surface Enable end-to-end visibility into your Java apps with a single command Measure and improve mobile app startup performance with Datadog RUM Evaluating our AI Guard application to improve quality and control cost Identify untested code across every level of your codebase Make use of guardrail metrics and stop babysitting your releases Monitor Versa Networks SD-WAN performance in Datadog How we reduced the size of our Agent Go binaries by up to 77% Improve performance and reliability with APM Recommendations Remediate transitive vulnerabilities faster with Datadog Software Composition Analysis Generate audit-ready vulnerability and compliance reports with Datadog Sheets Monitor Fortinet FortiManager performance in Datadog Improve test coverage across codebases with Datadog Code Coverage Move fast, don’t break things: Consistent testing standards at scale
Key metrics for monitoring Karpenter
2026-03-11 · via Datadog | The Monitor blog

In Part 1 of this series, we explored how Karpenter’s architecture enables just-in-time provisioning and active node consolidation. Because Karpenter is constantly making infrastructure decisions based on real-time scheduling pressure, its metrics can give you early warning of provisioning slowdowns, cloud API throttling, and misconfigurations that prevent it from scaling the way you expect. In this post, we’ll show you key metrics you can monitor to understand Karpenter’s behavior and performance. As you collect Karpenter metrics, note that each one is marked as STABLE, BETA, ALPHA, or DEPRECATED. BETA and ALPHA metrics are useful, but they’re more likely to change across versions, so you should treat them as a signal to double-check your dashboards after upgrades.

Track Karpenter metrics to monitor performance

Karpenter exposes Prometheus-formatted metrics via an HTTP endpoint at /metrics on the Karpenter controller. The default metrics port is 8080. This can be overridden at install time via the METRICS_PORT environment variable.

You can collect Karpenter metrics by using either of two approaches. If you use the Prometheus Operator with a ServiceMonitor, you can determine the metrics endpoint port by examining the Karpenter service, such as with this command:

kubectl -n karpenter get svc karpenter -o wide

If you use a standard Prometheus scrape config to collect metrics, you can determine the port by inspecting the controller’s METRICS_PORT setting.

In this section, we’ll describe the key metrics you should monitor to track Karpenter’s health and performance. Specifically, we’ll look at metrics from these categories:

Scheduling and pod life cycle metrics

If Karpenter is working well, you should see a predictable pattern during scale-out: The Kubernetes scheduler marks pods unschedulable, Karpenter reacts by creating capacity, nodes join the cluster, and pods transition to running. The metrics in this section help you measure that end-to-end experience and then pinpoint the source of any latency that arises.

Metric to alert on: karpenter_pods_startup_duration_seconds

This metric illustrates the total time it takes Karpenter to provision capacity. It spans the whole path from a pod being created to that pod reaching a running state. If your primary goal is to alert when users are likely to experience the effects of latency in the scaling process, this is an important metric. It’s directly aligned with what workloads experience rather than what any single component is doing. If you observe an increase in this metric, you should look for the cause in upstream processes such as Karpenter’s scheduling simulation, cloud API latency or errors, or node life cycle delays.

Metric to watch: karpenter_scheduler_scheduling_duration_seconds

Karpenter’s scheduler simulation time helps you understand whether a delay is occurring before Karpenter reaches out to the cloud provider. You may see this metric increase when Karpenter has to evaluate more possibilities to satisfy constraints—the rules that limit which nodes a pod can run on and what capacity Karpenter is allowed to provision. Those constraints can come from the workload (for example, node selectors, taints and tolerations, or large resource requests), from your NodePool requirements (such as restricting instance families, zones, or capacity type), or from placement rules that narrow options (like pod anti-affinity or topology spread). The tighter the constraints, the smaller the set of valid options and the more work Karpenter may need to do before it finds a viable match.

Metric to alert on: karpenter_scheduler_queue_depth

Queue depth tracks the number of pods currently waiting to be scheduled by Karpenter. A queue that rises briefly during bursts and then drains is normal. But if the queue grows and stays elevated, Karpenter isn’t keeping up. That can happen because Karpenter is taking longer than usual to evaluate feasible capacity (often due to more complex scheduling requirements), because it’s retrying failed requests, or because it’s blocked downstream—for example, when the cloud provider can’t supply capacity that matches your requirements.

This metric provides an early warning of issues that cause Karpenter to fall behind, and it often signals the problem before the worst impact is visible. You should investigate the cause of the slowdown by looking for correlated latency or errors in Karpenter’s scheduling, and for cloud provider errors that indicate that the requested capacity is unavailable.

To pinpoint the cause, correlate it with karpenter_scheduler_scheduling_duration_seconds to see if scheduling is slow. You can also look for correlations with the karpenter_cloudprovider_duration_seconds and karpenter_cloudprovider_errors_total metrics to see if the calls to the cloud provider API are slow or failing.

Disruption and consolidation metrics

Karpenter’s disruption features—consolidation, drift remediation, and other voluntary node replacement—are where cost and efficiency gains often come from. The tricky part is that disruption depends on factors outside Karpenter’s control (evictions, budgets, and workload behavior). The metrics in this section reveal how actively Karpenter is optimizing the cluster by removing underutilized or drifted nodes.

Metric to watch: karpenter_voluntary_disruption_eligible_nodes

This metric indicates whether Karpenter is finding opportunities to save money. A consistently large number of eligible nodes means Karpenter is identifying them but is unable to disrupt them. This can happen if pods on the node are protected by a PodDisruptionBudgets (PDBs) or if Karpenter can’t create suitable replacement capacity.

A low count in a well-packed cluster is normal. But few disruption-eligible nodes in a cluster with visibly underutilized nodes (high karpenter_nodes_allocatable but low karpenter_nodes_total_pod_requests) may indicate that consolidation is disabled or blocked.

Metric to alert on: karpenter_nodeclaims_termination_duration_seconds

This metric measures the time from a deletion request to the final removal of the NodeClaim. Ideally, nodes get deleted quickly to head off unnecessary cloud costs. But if the process of draining workloads from the node is stalling—which can happen if a PodDisruptionBudget blocks eviction—termination duration (and cloud costs) can remain elevated. If termination duration tail latency rises, look in your Karpenter logs for contributors such as workloads that can’t be evicted, stuck finalizers, or prolonged draining behavior.

Metrics to watch: karpenter_nodeclaims_created_total, karpenter_nodeclaims_terminated_total

Use these counters to confirm that Karpenter is both adding capacity when needed and removing underutilized nodes when it can. Start by monitoring NodeClaims created, which increments whenever Karpenter creates a NodeClaim in response to scheduling demand. Pair it with the terminated metric to track the raw volume of node churn over time. If Karpenter is not terminating instances even when karpenter_voluntary_disruption_eligible_nodes is above zero, the termination controller is blocked. You can investigate this issue by looking at Karpenter logs and the karpenter_nodeclaims_termination_duration_seconds metric.

Metric to alert on: karpenter_nodeclaims_disrupted_total (ALPHA)

To understand why nodes are turning over—for example, whether due to consolidation, drift remediation, or expiration—you can look at this metric’s reason label.

A rise in karpenter_nodeclaims_disrupted_total{reason="registration_timeout"} indicates that nodes were increasingly unable to join the cluster. You’ll see a corresponding increase in the karpenter_nodeclaims_created_total metric but no increase in usable nodes.

You should alert on a rise in these timeouts. When this alert fires, you know the issue isn’t with Karpenter’s scheduling decisions, and you should instead investigate node launch failure modes like IAM or UserData misconfigurations, networking blockages, or bad AMIs. Because this metric is marked as ALPHA, you should verify after each Karpenter upgrade that the metric name and reason label values are unchanged before relying on this alert.

Cloud provider metrics

You may experience Karpenter issues that are actually the effect of cloud provider limits that arise during rapid provisioning. These include capacity shortages, quota limits, throttling, and API latency. The metrics in this section can surface the root cause of Karpenter performance lags by highlighting the specific error rates and latencies within your cloud provider’s provisioning APIs.

Metrics to watch: karpenter_cloudprovider_errors_total, karpenter_cloudprovider_duration_seconds

These metrics surface problems that often look like Karpenter issues—such as pending pods, slow scale-out, or stalled replacements—but are actually driven by API failures or latency in the cloud provider layer.

The karpenter_cloudprovider_errors_total metric counts Karpenter requests to the cloud provider API that result in an error. You can filter on the metric’s error label to understand why requests are failing. For example, to track throttling specifically, look at karpenter_cloudprovider_errors_total{error="RequestLimitExceeded"}. To find requests that failed because the desired instance type was unavailable, look at error="InsufficientInstanceCapacity".

The karpenter_cloudprovider_duration_seconds metric measures the latency of Karpenter’s requests to the cloud provider API. An increase here indicates that scale-out will slow down even when requests are succeeding. The slowdown multiplies as Karpenter has more work to do and the rate of API calls goes up.

You can correlate these two metrics with the end-to-end scale-out metric, karpenter_pods_startup_duration_seconds. If startup is slowing and cloud provider errors are rising, Karpenter is being blocked by cloud API failures. When both startup duration and cloud provider duration increase, the issue is probably with the cloud provider, not with Karpenter’s scheduling performance.

Controller internals and cluster state metrics

These metrics provide critical information about Karpenter’s health to help you see whether Karpenter is keeping up with the rate of change in your cluster. Specifically, the metrics here can help you understand whether the Kubernetes API latency plays a role in any observed Karpenter latency.

Metric to alert on: controller_runtime_reconcile_time_seconds

This metric measures the duration of Karpenter’s reconciliation loop, which includes detecting pending pods, running scheduling simulations, and creating NodeClaims. An increase in the time Karpenter takes to process each unit of work can indicate that the controller is overwhelmed or waiting on slow Kubernetes API calls.

This metric measures the duration of Karpenter’s reconciliation loops across all controllers, including pod scheduling, NodeClaim life cycle management, and disruption. Use the controller label to isolate which reconciler is slow. A sustained increase in the provisioner controller indicates scheduling pressure. In the disruption controller, it may reflect PDB contention or blocked consolidation.

Metrics to watch: controller_runtime_reconcile_total, controller_runtime_reconcile_errors_total

The counts of reconciliations and errors give you a quick understanding of Karpenter’s throughput and health. Watch these alongside reconcile time when you’re troubleshooting slow provisioning. If reconcile time and errors rise together for the controllers that reconcile pods, NodeClaims, and NodePools, Karpenter may be stuck in a retry loop. But if reconcile time rises while these remain flat, look for cloud provider latency or other downstream bottlenecks.

Metric to watch: workqueue_depth

If the depth of the controller’s work queue is consistently high, it signals that the controller can’t keep up with the pace of changes in the cluster. Karpenter can fall behind if its reconciliation loop is slowing down (see controller_runtime_reconcile_time_seconds) or if the controller is blocked and retrying. As a result, Karpenter is slow to replace nodes and scale out your application, even when cloud-provider capacity is available.

Cost optimization and interruption metrics

Karpenter optimizes your Kubernetes operations in two ways: It cuts costs through instance consolidation and shields your workloads from node churn. The metrics in this section can help you validate that Karpenter is making cost-aware choices and ensure that its interruption frequency doesn’t contribute to instability.

Metric to watch: karpenter_cloudprovider_instance_type_offering_price_estimate

This metric provides the estimated price for instance types that Karpenter considers. Compare this metric to cloud provider prices to verify that you’re launching cost-efficient instances, especially after you change NodePool requirements.

Note that it has high cardinality, though. Labels can include instance type, region, zone, and capacity type, so enabling it broadly can increase your monitoring costs. You can manage this by filtering aggressively and limiting its use to ad hoc inquiries rather than ongoing dashboard use.

Metric to watch: karpenter_interruption_received_messages_total

This metric counts the number of Spot interruption or maintenance events received from the cloud provider. If you see this rise, it means that your cluster is churning more frequently than usual. This increased rate of node replacements can create brief capacity gaps—where nodes are terminated faster than replacement nodes register—so more pods may be stuck pending. That volatility can impact your application’s performance.

A corresponding rise in karpenter_pods_startup_duration_seconds confirms that these interruptions aren’t just transparent cost optimization, but are contributing to measurable workload impact.

Gain visibility into your just-in-time provisioning

Monitoring Karpenter requires more than just tracking CPU usage; it requires visibility into the decisions your autoscaler is making. By correlating scheduling latency with cloud provider errors and consolidation logs, you can ensure your cluster remains both performant and cost efficient. In Part 3 of this series, we’ll look at vendor-agnostic tooling for monitoring Karpenter. In Part 4, we’ll explore how to visualize these costs by using Datadog Cloud Cost Management to see exactly how your NodePool policies impact your bottom line.