惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Datadog | The Monitor blog

Reduce CVE noise with OpenVEX assessments in Datadog How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability How to audit and clean up monitors effectively Diagnose slow PostgreSQL queries faster with explain plan correlation Explore Datadog metrics with Natural Language Queries Toto 2.0: Time series forecasting enters the scaling era Simplify micro-frontend observability with Datadog RUM Attribute AI costs across providers with Datadog Cloud Cost Management Diagnose and resolve database performance issues faster with Database Investigator Datadog for Government achieves FedRAMP® High certification Analyze cloud costs with flexible spreadsheets in Datadog Sheets Inside Datadog’s AI Research Lab: Meet two PhD candidates behind Toto Connect triage and investigation in a single workflow with Datadog Cloud SIEM This Month in Datadog - April 2026 Monitor and optimize Supabase query performance with Datadog Database Monitoring Add dynamically updating context to logs with Reference Tables and Observability Pipelines Introducing ARFBench: A time series question-answering benchmark based on real incidents The product signal latency gap slowing your growth Test network paths with TCP, UDP, and ICMP in Datadog Turn developer feedback into operational insight with Datadog Forms and Sheets How to investigate cloud credential compromise with Bits AI Security Analyst Evaluate, optimize, and secure your Google Cloud AI stack with Datadog Bringing observability data hosting to the UK on AWS Identify and fix code issues faster with Datadog’s Azure DevOps Source Code integration Steganography at scale: Embedding share URLs in Datadog widget screenshots Every team should be A/B testing Centralize observability management with Datadog Governance Console Spotting CI/CD misconfigurations before the bots do: Securing GitHub Actions with Datadog IaC Security Route OTel data from AI apps to ClickHouse and Datadog using Observability Pipelines Manage service tracing across hosts with Single Step Instrumentation rules Offline evaluation for AI agents: Best practices Detect runtime threats in Python Lambda functions with Datadog AAP Introducing our open source AI-native SAST Instrument and monitor Boomi integration flows with OpenTelemetry and Datadog Not all index scans are equal: How we cut query latency by over 99% Platform engineering metrics: What to measure and what to ignore Integrate Recorded Future threat intelligence with Datadog Cloud SIEM CI/CD security: threat modeling using a MITRE-style threat matrix CI/CD security: How to secure your GitHub ecosystem Ingress NGINX is EOL: A practical guide for migrating to Kubernetes Gateway API How we built a real-world evaluation platform for autonomous SRE agents at scale Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA Introducing the Datadog Code Security MCP Capture and analyze custom heatmaps in Session Replay Understand session replays faster with AI summaries and smart chapters Monitor ClickHouse query performance with Datadog Database Monitoring How we designed empathetic alert sounds for on-call engineers Search and act across Datadog to resolve issues faster with Bits Assistant Measure the business impact of every product change with Datadog Experiments Analyzing round trip query latency Configuring JavaScript caches for better performance Introducing Bits AI Dev Agent for Code Security Datadog achieves ISO 42001 certification for responsible AI Monitor Nutanix clusters, hosts, and VMs with Datadog Monitor Juniper Mist in Datadog A new Host Map for modern infrastructure When upserts don't update but still write: Debugging Postgres performance at scale Annotate traces to improve LLM quality with Datadog LLM Observability What's new in Cloud SIEM: AI-powered investigations, enhanced threat intelligence, and scalable security operations Explore Kubernetes with native OpenTelemetry data Monitor Oracle Fusion Cloud Applications with Datadog Announcing the Datadog Terraform provider v4.0.0 Scaling Kubernetes workloads on custom metrics How to design cloud environments for AI-powered threat analysis Monitor Aruba Central in Datadog How we centralize and remediate risks with Datadog Case Management Accelerate incident response with Datadog and ServiceNow Monitor your application and network load balancer logs Understanding Karpenter architecture for Kubernetes autoscaling Tools for collecting metrics and logs from Karpenter Monitor Karpenter with Datadog What your product data is actually saying Key metrics for monitoring Karpenter Securing Datadog's platform in the AI age: The role of observability data Closing the verification loop: Observability-driven harnesses for building with agents When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos Closing the verification loop, Part 2: Fully autonomous optimization Four ways engineering teams use the Datadog MCP Server to power AI agents Approaching your observability migration with the right mindset Meet the new Bits AI SRE: Deeper reasoning, twice as fast Designing MCP tools for agents: Lessons from building Datadog's MCP server Key learnings from the 2026 State of DevSecOps study Use plain English to query your multi-cloud infrastructure in Resource Catalog Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring Protect your OCI resources with Datadog Cloud Security This Month in Datadog - February 2026 Fine-tune Toto for turbocharged forecasts Amazon EC2 security: How misconfigured and public AMIs expand your cloud attack surface Enable end-to-end visibility into your Java apps with a single command Measure and improve mobile app startup performance with Datadog RUM Evaluating our AI Guard application to improve quality and control cost Identify untested code across every level of your codebase Make use of guardrail metrics and stop babysitting your releases Monitor Versa Networks SD-WAN performance in Datadog How we reduced the size of our Agent Go binaries by up to 77% Improve performance and reliability with APM Recommendations Remediate transitive vulnerabilities faster with Datadog Software Composition Analysis Generate audit-ready vulnerability and compliance reports with Datadog Sheets Monitor Fortinet FortiManager performance in Datadog Improve test coverage across codebases with Datadog Code Coverage
Aligning SRE and security for better incident response
2025-09-19 · via Datadog | The Monitor blog

In this series, we looked at why we combined our SRE and security teams into one cohesive group, and how we made that happen. With this combined approach, we set out to build our internal platform and customer-facing products with a security-first mindset, while still drawing upon the deep expertise of our existing SRE practices. Combining the teams improved the way we build tools for both our engineers and customers and strengthened our ability to mitigate risks. This post focuses on how that approach improved our incident response and how you can apply these lessons to your own organization.

How integrating our teams improved incident response

As we worked to integrate security with SRE, we identified common patterns in our day-to-day operations that showed why alignment would benefit both sides. These included duplicate tooling and processes, frequent incidents that spanned both reliability and security domains, and notable gaps in coverage for audit logging and change control. Before we could improve our incident response, we had to first address some of these central pain points for both groups, such as knowledge siloes and gaps in our monitoring capabilities.

Addressing these issues resulted in platform-wide configuration baselines, comprehensive team documentation, shared runbooks and dashboards, and cross-functional exercises to strengthen incident response. These assets helped us foster a blameless, collaborative approach to managing incidents by giving our engineers the tools they needed to work together efficiently.

Platform-wide baselines as a foundation for response decisions

Configuration baselines, which establish the standards for how we should set up our systems, help us shift controls left to prevent incidents and also guide our response when something goes wrong. Our baselines treat compliance requirements as the bare minimum but scale to help us focus on the most impactful platform issues and align with our golden paths.

By establishing clear, platform-wide standards for security and reliability, our teams have a shared reference point that makes it easier to identify whether an issue is urgent enough to be declared an incident or should simply be tracked as a bug or vulnerability instead. This consistency reduces hesitation and ensures that problems aren’t ignored.

We maintain a list of cloud-agnostic baselines, which are routinely reviewed and updated as needed. Our methodology for creating these standards involves evaluating individual rules against the following questions:

  • Can we reasonably enforce this rule with minimal triaging and custom logic?
  • Is the risk that this rule addresses sufficient to warrant expedited attention?
  • Are there legitimate reasons why this rule might frequently generate false positives?
  • Is this rule a standard best practice or a security concern?

Every rule is assigned a severity level, which helps us prioritize findings and ties directly into our criteria for declaring an incident versus creating a bug or vulnerability report. Not every misconfiguration or vulnerability needs an early morning page, so we wanted these baselines to be well-defined up front so engineers aren’t left guessing.

In practice, these baselines function as both preventive guardrails and decision-making tools during incidents. For example, if a baseline requires that all production databases be encrypted, we can immediately classify a discovered unencrypted volume as high severity. On the other hand, a misconfiguration that has existed quietly for two months may not trigger an urgent incident, but it should still be monitored and assessed.

A compliance dashboard showing essential cloud security controls.

This alignment between baselines and escalation paths reduces hesitation in addressing an issue. With it, engineers can confidently declare an incident because they have the data they need to do so. It also helps ensure that we don’t ignore important problems simply because they don’t fit a narrow definition of an incident.

Over time, incidents surface gaps in our baseline configurations as well. For example, if our investigation during a security incident reveals missing audit logs, we will adjust our requirements for logging configurations, such as retention periods and formats, where necessary. We also continually update our threat detections based on the cause of a security incident, such as a threat actor attempting to compromise accounts. These iterative updates ensure our baselines remain effective, and they create a consistent system that helps us mitigate configuration drift, respond efficiently to high-risk issues, and strengthen both the security and reliability of our platform.

Well-defined expectations and guidelines for managing incidents

Merging our SRE and security groups required a shared understanding of expectations during incidents, so we unified guidelines and tooling for both security- and reliability-related events. These steps ensure that security incidents follow the same patterns and timelines as any other operational incident, and that familiarity makes incidents less intimidating to manage.

To set these expectations, we considered the following questions:

  • Who do we bring in during an incident?
  • What should their response time be?
  • What steps are they expected to take for each incident?

These questions let us define role-specific guidelines so that everyone working on an incident is confident in their responsibilities and support. For example, all relevant security teams go through our standard incident management training, and gaps in our response protocols are remediated with approval from security leadership. Within this shared framework, we introduced security leads, a new role that not only drives security incidents but also provides relevant context and direction during other types of critical events.

As part of our standard steps for declaring an incident and establishing a current state, we also conduct a security-focused risk assessment. This is a structured set of questions that a security lead answers when called to investigate a reported security-related issue:

  • If a threat actor is involved, what is their objective? How confident are you in this assessment?
  • Can you determine which stage of the attack the threat actor is in, such as initial access?
  • What are the most likely attack paths for a threat actor to achieve their objectives?
  • How likely is it that a threat actor without internal knowledge could identify these paths?

We encourage security leads to use words of estimative probability (WEPs) to determine the likelihood of a specific outcome, such as a threat actor identifying attack paths within our systems. These probability estimates enable our team to scope and prioritize risk effectively.

Shared runbooks and dashboards

In addition to comprehensive team and process documentation, we also unified our incident runbooks and monitoring dashboards. Our organization maintains a library of detailed runbooks that we use as part of the incident management process. Security runbooks are developed by our Security Incident Response team (SIRT) in close partnership with other relevant system owners, which ensures technical accuracy. This collaboration also serves as a planning exercise by allowing us to clarify how teams will work together during an incident, what information each role will need, and when they’ll need it. Having greater visibility across both reliability and security domains enables us to easily follow a predictable set of remediation steps and resolve issues faster.

Our shared runbooks include high-level graphs that help teams quickly scope the incident to a specific timeframe. They also contain links to relevant logs and traces, along with guidance on what to look for during the review process. For example, in the case of possible DDoS activity, our runbooks include the following dedicated queries that help teams evaluate the likelihood of a legitimate attack:

  • Spikes in requests to specific routes, IP addresses, or ASNs
  • Surges in authentication or authorization attempts (successful or not)
  • Unusual increases in 2xx responses, which could signal an HTTP flood
  • Spikes in 4xx responses, which may indicate a credential stuffing attack attempting to blend in with DDoS traffic

If an incident surfaces a gap in coverage for how to investigate or resolve an issue, we encourage engineers to create the necessary documentation and update runbooks accordingly as part of the postmortem process. Our cross-functional dashboards provide a high-level overview of the critical data we use to investigate incidents so we can quickly connect reliability context to security issues. For example, if we notice an unusual spike in failed login attempts, we can pivot directly to related security signals to investigate further.

Pivot to a security signal from performance dashboard.

Each signal includes additional context, such as associated IP addresses or geolocation, that helps us quickly determine whether the activity stems from a platform misconfiguration or a legitimate attack, such as a distributed credential stuffing campaign.

Datadog App and API Protection signal

By building dashboards that SRE and security teams use together, every responder works from the same context, which reduces miscommunication and accelerates the decision-making process.

Cross-functional exercises for incident response

Building resilience involves more than just improving the way we remediate issues. We also wanted to find opportunities to practice incident response before an issue occurs. We regularly conduct exercises that simulate both security and reliability incidents so we can refine our incident management processes and documentation.

For security, SIRT participates in purple team exercises alongside our threat detection group. These drills help refine detection logic, improve runbooks, and give engineers the muscle memory to handle incidents as if they were real. Some drills are live simulations, while others are theoretical or whiteboard-based tabletop scenarios that let us explore edge cases without affecting our production environments.

To test platform reliability, we take a similar approach through chaos engineering experiments and both small- and large-scale gamedays. These events deliberately introduce controlled failures into our systems, giving teams the opportunity to diagnose and remediate issues under realistic conditions.

Our goals are the same regardless of the type of exercise we conduct: Identify weaknesses early, improve processes and tooling in a safe environment, and ensure that our team can respond quickly when a real incident happens.

An improved approach to incident response

In this post, we looked at the outcomes of combining our SRE and security groups as well as how that approach significantly improved the way we manage incidents. While every organization’s structure is different, the principles of shared visibility and platform enablement can apply broadly. If you’re exploring similar changes for your reliability and security teams, start by identifying shared pain points and aligning on team goals. From there, incremental changes in process visibility and ownership will allow you to build the necessary tools for collaborative incident response and more resilient, secure applications.

The workflows we described in this post, such as creating shared runbooks and dashboards, are possible with Datadog. Check out our documentation if you’d like to learn more about our incident management and security capabilities. If you’re new to Datadog, you can sign up for a free 14-day trial.