惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - 司徒正美
大猫的无限游戏
大猫的无限游戏
Scott Helme
Scott Helme
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
S
Secure Thoughts
Google DeepMind News
Google DeepMind News
博客园_首页
Hacker News: Ask HN
Hacker News: Ask HN
量子位
Jina AI
Jina AI
I
InfoQ
V
V2EX
Martin Fowler
Martin Fowler
Y
Y Combinator Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
人人都是产品经理
人人都是产品经理
B
Blog
IT之家
IT之家
云风的 BLOG
云风的 BLOG
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - Franky
博客园 - 【当耐特】
N
Netflix TechBlog - Medium
Cloudbric
Cloudbric
H
Heimdal Security Blog
TaoSecurity Blog
TaoSecurity Blog
S
Security @ Cisco Blogs
U
Unit 42
Project Zero
Project Zero
Webroot Blog
Webroot Blog
The Register - Security
The Register - Security
N
News | PayPal Newsroom
Microsoft Security Blog
Microsoft Security Blog
H
Help Net Security
Forbes - Security
Forbes - Security
宝玉的分享
宝玉的分享
Last Week in AI
Last Week in AI
C
Check Point Blog
博客园 - 聂微东
M
MIT News - Artificial intelligence
有赞技术团队
有赞技术团队
D
DataBreaches.Net
Cyberwarzone
Cyberwarzone
N
News and Events Feed by Topic
N
News and Events Feed by Topic
Simon Willison's Weblog
Simon Willison's Weblog
J
Java Code Geeks
G
Google Developers Blog
GbyAI
GbyAI
T
Threatpost

Datadog | The Monitor blog

Introducing our open source AI-native SAST Instrument and monitor Boomi integration flows with OpenTelemetry and Datadog Not all index scans are equal: How we cut query latency by over 99% Platform engineering metrics: What to measure and what to ignore Integrate Recorded Future threat intelligence with Datadog Cloud SIEM CI/CD security: threat modeling using a MITRE-style threat matrix CI/CD security: How to secure your GitHub ecosystem Ingress NGINX is EOL: A practical guide for migrating to Kubernetes Gateway API Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA Introducing the Datadog Code Security MCP Capture and analyze custom heatmaps in Session Replay Understand session replays faster with AI summaries and smart chapters Monitor ClickHouse query performance with Datadog Database Monitoring How we designed empathetic alert sounds for on-call engineers Search and act across Datadog to resolve issues faster with Bits Assistant Measure the business impact of every product change with Datadog Experiments Analyzing round trip query latency Configuring JavaScript caches for better performance Introducing Bits AI Dev Agent for Code Security Datadog achieves ISO 42001 certification for responsible AI Monitor Nutanix clusters, hosts, and VMs with Datadog Monitor Juniper Mist in Datadog A new Host Map for modern infrastructure Annotate traces to improve LLM quality with Datadog LLM Observability What’s new in Cloud SIEM: AI-powered investigations, enhanced threat intelligence, and scalable security operations Explore Kubernetes with native OpenTelemetry data Monitor Oracle Fusion Cloud Applications with Datadog Announcing the Datadog Terraform provider v4.0.0 Scaling Kubernetes workloads on custom metrics How to design cloud environments for AI-powered threat analysis Monitor Aruba Central in Datadog How we centralize and remediate risks with Datadog Case Management Accelerate incident response with Datadog and ServiceNow Monitor your application and network load balancer logs Understanding Karpenter architecture for Kubernetes autoscaling Tools for collecting metrics and logs from Karpenter Monitor Karpenter with Datadog What your product data is actually saying Key metrics for monitoring Karpenter Securing Datadog’s platform in the AI age: The role of observability data Four ways engineering teams use the Datadog MCP Server to power AI agents Approaching your observability migration with the right mindset Meet the new Bits AI SRE: Deeper reasoning, twice as fast Key learnings from the 2026 State of DevSecOps study Use plain English to query your multi-cloud infrastructure in Resource Catalog Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring Protect your OCI resources with Datadog Cloud Security This Month in Datadog - February 2026 Amazon EC2 security: How misconfigured and public AMIs expand your cloud attack surface Enable end-to-end visibility into your Java apps with a single command Measure and improve mobile app startup performance with Datadog RUM Evaluating our AI Guard application to improve quality and control cost Identify untested code across every level of your codebase Make use of guardrail metrics and stop babysitting your releases Monitor Versa Networks SD-WAN performance in Datadog Improve performance and reliability with APM Recommendations Remediate transitive vulnerabilities faster with Datadog Software Composition Analysis Generate audit-ready vulnerability and compliance reports with Datadog Sheets Monitor Fortinet FortiManager performance in Datadog Improve test coverage across codebases with Datadog Code Coverage Move fast, don’t break things: Consistent testing standards at scale Enrich logs with ServiceNow CMDB context before routing to any SIEM or logging tool Monitor Lustre with Datadog Make faster, better product decisions with Datadog Product Analytics Surface and remediate runtime posture issues with Workload Protection Findings Protect agentic AI applications with Datadog AI Guard How to optimize JavaScript code with CSS Trace Google Pub/Sub workloads in Cloud Run with Datadog Detect human names in logs with ML in Sensitive Data Scanner How we cut our NLQ agent debugging time from hours to minutes with LLM Observability Debug PostgreSQL query latency faster with EXPLAIN ANALYZE in Datadog Database Monitoring Datadog acquires Propolis Unify and correlate frontend and backend data with retention filters Scale compliance across global frameworks with Datadog Cloud Security Monitor Arista VeloCloud SD-WAN performance with Datadog Building reliable dashboard agents with Datadog LLM Observability Simplify log collection and aggregation for MSSPs with Datadog Observability Pipelines Mitigation for Node.js denial-of-service vulnerability affecting Datadog APM Automate flaky test fixes with the Bits AI Dev Agent and Test Optimization How we built an AI SRE agent that investigates like a team of engineers Datadog integrations 2025 recap: Observability for AI, security, and hybrid cloud Design effective executive dashboards with Datadog Implement dbt data quality checks with dbt-expectations Bring faster visibility into AWS Lambda functions with remote instrumentation Troubleshoot faster with the GitLab Source Code integration in Datadog How Cambia Health Solutions saved $30,000 monthly with Cloud Cost Management and the Datadog Resource Catalog Normalize any logs for Cloud SIEM with Datadog's OCSF processor Optimizing Datadog at scale: Cost-efficient observability at Zendesk Detect, diagnose, and resolve network issues easily with CNM Network Health Connect engineering errors to user impact in early-stage products Cilium configuration for Kubernetes operations at scale Designing feedback loops for progressive delivery Ship features faster and safer with Datadog Feature Flags Choosing the right OpenTelemetry Collector distribution Route your monitor alerts with Datadog monitor notification rules Automate Cloud SIEM investigations with Bits AI Security Analyst Cloud threat detection: How to identify risky activity across control and data planes Collecting Kafka performance metrics Monitoring Kafka with Datadog Monitoring Kafka performance metrics
Monitor Ceph: From node status to cluster-wide performance
2016-09-08 · via Datadog | The Monitor blog

Ceph is an open source, distributed object store and file system designed for scalability, reliability, and performance. It provides an S3- and Openstack Swift–compatible API for storing, retrieving, and managing objects.

Ceph has native CLI tools that can check the health of a running cluster. But if you rely on Ceph in production, you will need dedicated monitoring to provide historical and infrastructure-wide context. Using Datadog you can correlate Ceph’s throughput and latency with traffic and the operations of any other system or application components in your infrastructure, helping you pinpoint performance bottlenecks and provision resources appropriately.

This blog post covers some of the most important aspects of monitoring and investigating Ceph’s operations and performance with Datadog.

Datadog’s Ceph template screenboard
Monitor Ceph - Datadog’s Ceph template screenboard for monitoring overview
Datadog’s Ceph template screenboard

Monitor Ceph performance at any level of granularity

Cluster-wide metrics at a glance

A Ceph cluster often runs on tens or even hundreds of nodes. When operating high-scale, distributed systems like this, you usually care more about the cluster-wide system performance than a particular node’s downtime.

Datadog gathers cluster-level metrics such as capacity usage, throughput, and more at a glance. If you use more than one cluster, you can easily zero in on one particular cluster by changing the cluster template variable at the top-left corner of the screenboard. This instant filtering is possible due to metric tagging, the importance of which we covered deeply in another post.

Zoom into the pool

Ceph breaks the cluster into units called pools, to better organize millions or billions of objects stored in the distributed system. Different pools are typically used for different purposes, and may be replicated and distributed using different strategies. Some pools are used for storing user authentication information, for example, thus geographically distributed replication across SSD devices may be required for extremely fast and reliable retrieval. Other pools in the same cluster might be used for archived logs, and that data may need just one replica and can be stored on slower but economical devices like magnetic disks.

Instead of simply aggregating metrics from those divergent uses together, Datadog provides read/write throughput, capacity usage, and number of objects stored with pool-level resolution. You can leverage this feature to gather details about each pool’s usage pattern and quickly spot overloaded pools to tune the application for better performance.

Know each node’s status

In a Ceph cluster, Monitor nodes are responsible for maintaining the cluster map, and Object Storage Daemons (OSDs) reside on data nodes and provide storage services for Ceph clients over the network. Datadog’s built-in, customizable Ceph dashboard displays the aggregated status of each node type for a high-level view of cluster health.

Avoid Monitor-quorum deadlock

Your production Ceph cluster is likely deployed with more than one Monitor node for increased reliability and availability. For a Ceph cluster to function properly, at least 51 percent of all available Monitors need to be reachable. If the number falls below this threshold, Ceph will end up in a deadlock.

Datadog can alert you if you lose Monitors and are in danger of hitting this threshold so that you can either deploy more Monitor nodes or repair and bring the existing Monitor nodes back up again before your cluster suffers a service outage.

Monitor Ceph - OSD node status with Datadog

Constantly “down but in” OSDs

Usually, when an OSD is down, it will be automatically marked out by Ceph, and replication is triggered to copy its data to other nodes. This process may take up to 10 minutes. Under some conditions, however, OSDs may be down, but remain in the cluster. It is very likely that Ceph is experiencing some difficulty in recovering from the node loss and manual intervention might be needed. If this happens Datadog can inform you quickly so you can investigate and save the cluster from experiencing degraded performance.

Full and near-full OSDs

To ensure objects are evenly distributed across OSDs, Ceph recommends a minimum number of placement groups. If the balance between the number of OSDs, placement groups, and replicas is broken, either by a sudden change in the available nodes or misconfigured CRUSH map rules, some OSDs may be filled with objects before the overall cluster reaches its full capacity.

Datadog helps you to monitor the storage capacity at every level, including at the node level. So if any of the OSDs are full or nearly full, you can see that immediately on the dashboard.

Watch your apply and commit latency

Monitor Ceph commit and apply latency with Datadog

If caching is enabled, Ceph doesn’t write all updates directly to disk. This speeds up the transaction, but introduces an extra step before data is permanently stored. A transaction is usually marked complete after the operation is successfully committed to the journal—the time required for this step is reported as “commit latency”. The time required to flush the operation to disk is reported as “apply latency.” It is important to keep an eye on both latencies to ensure that your system is performing quickly enough and is sufficiently robust to data loss, should a node go down before data is written to disk.

Getting started

If you are already using Datadog, you can start monitoring Ceph in minutes. If you are not a Datadog user yet, sign up for a full-featured trial account here to get started.