Observability in the AI age: Datadog's approach

2025-12-02 · via Datadog | The Monitor blog

Yanbing Li

Ten years ago, Datadog was a single-product company focused on breaking down the silos between dev and ops. As the shift towards the cloud accelerated and organizations transitioned to the new DevOps model, we set out to develop an observability platform that would enable these teams to safely scale faster and answer the essential questions about their services: are they available, secure, compliant, performant, and cost-efficient?

Now, the rapid adoption of AI has triggered another shift in how we think about creating and running software products. Those core observability questions remain pertinent, while unpredictable AI tooling that’s yet to be tested at scale introduces new kinds of risk and monitoring challenges. At the same time, LLMs present new ways to harness observability data and resolve issues faster than ever before, even taking autonomous actions.

Successful organizations must navigate the complexity introduced by the AI shift and innovate at an unprecedented pace to meet the market’s heightened demands. By staying ahead of this curve, Datadog can facilitate the industry’s growth. We’re looking at AI in two ways to serve our customers’ needs—not only introducing AI-powered features throughout our platform, but also building tailored monitoring tools to help organizations observe and improve their own AI systems. By the time the next wave of AI-native companies hits enterprise scale, we need to have mature monitoring tools for every level of the AI stack. And to help all organizations accelerate and improve their monitoring in the face of the new challenges we’re seeing, we’re investing heavily in R&D to push on agentic AI innovations that will bring the industry closer to fully autonomous remediation.

Agentic and embedded AI throughout our platform

Datadog has grown into a comprehensive solution that provides visibility at every layer of your stack: from networking, compute, and storage to platform scaffolding, application logic, and UX. Now, we’re pushing forward to help DevOps engineers meet the demands of the AI age. By using the unique expertise we’ve gathered from ingesting billions of data points a day from customers with highly varied use cases, architecture patterns, and maturity levels, Datadog is creating agentic AI to embed throughout our platform. Our large corpus of data about how companies handle issues in the real world gives us a unique edge as we train and fine-tune these agents.

Our AI agents—Bits AI SRE, Bits AI Dev Agent, and Bits AI Security Analyst—read telemetry from across your environment to power autonomous actions that help you resolve issues faster. Bits AI agents work like teammates, investigating alerts and security signals with correlated telemetry, coordinating incidents, scanning code, and suggesting code fixes and automations to resolve issues they discover using production context provided by Datadog.

At the same time, we are also working on ways to help engineers continue to incorporate observability into their AI workflows, including Model Context Protocol (MCP) and coding agents. Datadog MCP Server enables engineers who work with Codex, Claude Code, Cursor, and other MCP-compatible AI agents to harness Datadog telemetry during development. We’re seeing large organizations use Datadog MCP Server to build automated code change proposals, group and analyze debugging logs, and analyze code in context with telemetry to speed up incident investigations.

Finally, we’re working to close the gap between foundational models that handle text, images, and audiovisual data and those that handle the structured data modalities—including timeseries metrics—needed for predictive monitoring. Toto, Datadog’s state-of-the-art timeseries foundational model, is aimed at improving the AI, ML, anomaly detection, and forecasting algorithms already in use within the Datadog platform and powering products such as Watchdog and Bits AI.

Delivering AI observability and security

AI applications now run in dynamic, distributed systems where agents and models constantly change—learning, drifting, and evolving. Teams need granular telemetry on a production scale to monitor, secure, and refine their systems. Building on the foundation of APM, logs, infrastructure, and security, Datadog is launching products that provide critical visibility across the AI stack and support development, staging, and production environments.

Whether your team’s focus is as low-level as GPU optimization or as high up the stack as sentiment evaluation for a chatbot, Datadog’s end-to-end observability suite covers the bases while enabling your organization to form a bird’s-eye view of your entire system. Datadog LLM Observability, LLM Experiments, GPU Monitoring, Sensitive Data Scanner, and AI Guard provide tailored solutions for these unique observability challenges, covering experimental evaluation and fine-tuning as well as performance, security, cost, and compliance in production. We’re targeting the most critical concerns for organizations running agentic AI at scale:

Troubleshooting complex agentic workflows in production, which involves evaluating models and prompts as well as detecting application errors
Rapidly iterating prompts and application logic with experimentation
Optimizing infrastructure performance and cost
Implementing multi-layered, failsafe protection against jailbreaks, tool misuse, data exfiltration, and other key AI security threats

Bringing AI-native organizations into the Datadog platform helps us form a deeper understanding of how AI applications are being built and deployed today. This cohort includes not only dozens of high-velocity startups, but also 8 of the 10 largest players in the AI space—keeping us on the cutting edge of both scale and go-to-market speed. Our industry needs to build a flywheel that enables AI-native organizations to continue accelerating the pace of innovation while bringing this new value to production more reliably, more securely, and at an increasingly high scale. AI innovation at scale requires observability tools that can detect issues in your environment, choose the appropriate response, and execute remediations with true autonomy. Datadog has set our sights on this horizon as we mature our AI observability suite from a source of passive insight and context to an active collaborator that can help teams handle complex incidents—and eventually, perhaps, a closed system that runs your operations on its own.

Datadog helps solve the complexities of the AI age

As we move forward into the era of AI adoption, it remains uncertain just how much of our digital world will soon be autonomously managed. Datadog is growing our platform to meet this paradigm shift, working toward a future where proactive operations and security management help keep systems running smoothly and safely. Building these innovations takes significant investment, and that’s why we invest 29% of our revenue into R&D initiatives like our AI Research Lab (which produced Toto). This way, Datadog can deliver on the promise offered by the unparalleled data, context, and expertise we’ve garnered over our past decade of leading the observability sector: becoming the primary solution for AI-native organizations to rapidly innovate and scale up while progressing the entire software industry toward a zero-incident future.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Datadog | The Monitor blog

Agentic and embedded AI throughout our platform

Delivering AI observability and security

Datadog helps solve the complexities of the AI age