惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Datadog | The Monitor blog

Reduce CVE noise with OpenVEX assessments in Datadog How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability How to audit and clean up monitors effectively Diagnose slow PostgreSQL queries faster with explain plan correlation Explore Datadog metrics with Natural Language Queries Toto 2.0: Time series forecasting enters the scaling era Simplify micro-frontend observability with Datadog RUM Attribute AI costs across providers with Datadog Cloud Cost Management Diagnose and resolve database performance issues faster with Database Investigator Datadog for Government achieves FedRAMP® High certification Analyze cloud costs with flexible spreadsheets in Datadog Sheets Inside Datadog’s AI Research Lab: Meet two PhD candidates behind Toto Connect triage and investigation in a single workflow with Datadog Cloud SIEM This Month in Datadog - April 2026 Monitor and optimize Supabase query performance with Datadog Database Monitoring Add dynamically updating context to logs with Reference Tables and Observability Pipelines Introducing ARFBench: A time series question-answering benchmark based on real incidents The product signal latency gap slowing your growth Test network paths with TCP, UDP, and ICMP in Datadog Turn developer feedback into operational insight with Datadog Forms and Sheets How to investigate cloud credential compromise with Bits AI Security Analyst Evaluate, optimize, and secure your Google Cloud AI stack with Datadog Bringing observability data hosting to the UK on AWS Identify and fix code issues faster with Datadog’s Azure DevOps Source Code integration Steganography at scale: Embedding share URLs in Datadog widget screenshots Every team should be A/B testing Centralize observability management with Datadog Governance Console Spotting CI/CD misconfigurations before the bots do: Securing GitHub Actions with Datadog IaC Security Route OTel data from AI apps to ClickHouse and Datadog using Observability Pipelines Manage service tracing across hosts with Single Step Instrumentation rules Offline evaluation for AI agents: Best practices Detect runtime threats in Python Lambda functions with Datadog AAP Introducing our open source AI-native SAST Instrument and monitor Boomi integration flows with OpenTelemetry and Datadog Not all index scans are equal: How we cut query latency by over 99% Platform engineering metrics: What to measure and what to ignore Integrate Recorded Future threat intelligence with Datadog Cloud SIEM CI/CD security: threat modeling using a MITRE-style threat matrix CI/CD security: How to secure your GitHub ecosystem Ingress NGINX is EOL: A practical guide for migrating to Kubernetes Gateway API How we built a real-world evaluation platform for autonomous SRE agents at scale Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA Introducing the Datadog Code Security MCP Capture and analyze custom heatmaps in Session Replay Understand session replays faster with AI summaries and smart chapters Monitor ClickHouse query performance with Datadog Database Monitoring How we designed empathetic alert sounds for on-call engineers Search and act across Datadog to resolve issues faster with Bits Assistant Measure the business impact of every product change with Datadog Experiments Analyzing round trip query latency Configuring JavaScript caches for better performance Introducing Bits AI Dev Agent for Code Security Datadog achieves ISO 42001 certification for responsible AI Monitor Nutanix clusters, hosts, and VMs with Datadog Monitor Juniper Mist in Datadog A new Host Map for modern infrastructure When upserts don't update but still write: Debugging Postgres performance at scale Annotate traces to improve LLM quality with Datadog LLM Observability What's new in Cloud SIEM: AI-powered investigations, enhanced threat intelligence, and scalable security operations Explore Kubernetes with native OpenTelemetry data Monitor Oracle Fusion Cloud Applications with Datadog Announcing the Datadog Terraform provider v4.0.0 Scaling Kubernetes workloads on custom metrics How to design cloud environments for AI-powered threat analysis Monitor Aruba Central in Datadog How we centralize and remediate risks with Datadog Case Management Accelerate incident response with Datadog and ServiceNow Monitor your application and network load balancer logs Understanding Karpenter architecture for Kubernetes autoscaling Tools for collecting metrics and logs from Karpenter Monitor Karpenter with Datadog What your product data is actually saying Key metrics for monitoring Karpenter Securing Datadog's platform in the AI age: The role of observability data Closing the verification loop: Observability-driven harnesses for building with agents When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos Closing the verification loop, Part 2: Fully autonomous optimization Four ways engineering teams use the Datadog MCP Server to power AI agents Approaching your observability migration with the right mindset Meet the new Bits AI SRE: Deeper reasoning, twice as fast Designing MCP tools for agents: Lessons from building Datadog's MCP server Key learnings from the 2026 State of DevSecOps study Use plain English to query your multi-cloud infrastructure in Resource Catalog Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring Protect your OCI resources with Datadog Cloud Security This Month in Datadog - February 2026 Fine-tune Toto for turbocharged forecasts Amazon EC2 security: How misconfigured and public AMIs expand your cloud attack surface Enable end-to-end visibility into your Java apps with a single command Measure and improve mobile app startup performance with Datadog RUM Evaluating our AI Guard application to improve quality and control cost Identify untested code across every level of your codebase Make use of guardrail metrics and stop babysitting your releases Monitor Versa Networks SD-WAN performance in Datadog How we reduced the size of our Agent Go binaries by up to 77% Improve performance and reliability with APM Recommendations Remediate transitive vulnerabilities faster with Datadog Software Composition Analysis Generate audit-ready vulnerability and compliance reports with Datadog Sheets Monitor Fortinet FortiManager performance in Datadog Improve test coverage across codebases with Datadog Code Coverage
Create and monitor LLM experiments with Datadog
2025-06-10 · via Datadog | The Monitor blog
Tom Sobolik

Tom Sobolik

Shri Subramanian

Shri Subramanian

Charles Jacquet

Charles Jacquet

To efficiently optimize your LLM application before pushing to production, you need a comprehensive testing and evaluation framework. By running experiments, you can optimize prompts, fine-tune temperature and other key parameters, test complex agent architectures, and understand how your application may respond to atypical, complex, or adversarial inputs. However, it can be difficult to manage your experiment runs and aggregate the results for meaningful analysis. Additionally, without effective governance of your experiment datasets, it’s hard to be confident that your applications are being tested appropriately.

Datadog LLM Observability’s new Experiments feature enables you to quickly test out new prompts and models in the Playground, then curate and manage datasets to run experiments and analyze evaluations and telemetry across multiple runs. With the Experiments SDK, you can build, run, and automatically trace experiments with full span-level visibility into every LLM call, retrieval, tool, or agent step.

In this post, we’ll show how Experiments supports your entire LLM application development lifecycle—starting from quick Playground iterations on production traces, through dataset versioning and structured evaluations, to model comparison and deep output analysis—all from within LLM Observability’s unified environment.

Start with Playground to validate ideas quickly

When developing LLM applications, issues often surface first in traces or flagged evaluations—maybe a prompt produces an incomplete response, or a trace reveals unexpected latency. Playground enables you to immediately investigate these kinds of issues before setting up a full experiment. With a single click, you can import a problematic trace and replay it with alternative prompts, providers, or parameters. Because Playground uses Datadog’s trace spans (LLM spans, tool spans, retrieval spans, and agent spans), you can see exactly how each adjustment impacts latency, cost, and output quality for the specific workflow execution that you first observed an issue with.

For example, let’s say you’re building an LLM application to summarize news articles. You’ve set up online evaluations that flag problematic traces, and one trace shows the model truncating summaries mid-sentence. You can open that trace directly in Playground and test specific adjustments, such as:

  • Lowering the temperature to reduce randomness and improve consistency
  • Increasing the max_tokens parameter to prevent truncation and allow longer context
  • Adding or modifying stop sequences to enforce complete, well-formed summaries

By iterating in Playground, you can confirm whether the root cause was prompt design, configuration limits, or model behavior. Once you identify the best-performing configuration, you can export it directly into your codebase or promote it into a dataset for further experimentation. This workflow ensures that improvements validated interactively in Playground are carried forward into structured, repeatable experiments.

Build datasets for experimentation

Your experiments are only as good as the data you’re feeding them. It’s paramount to have clear visibility into the test data used for your experiments so you can ensure that this data is high-quality and annotated correctly, avoiding false evaluations. Creating test datasets can be tedious and error-prone. For instance, if you want to sample production data for your test datasets, you might find yourself meticulously copying and pasting LLM inputs and outputs from logs or traces into spreadsheets.

Datadog LLM Experiments solves this by letting you import data from your production traces in LLM Observability with a single click—or programmatically create test datasets by using the LLM Experiments SDK. LLM Experiments also supports version control for datasets, so you can pull and push datasets to Datadog to manage them in the UI and easily share them across your team. Plus, you can seed datasets with cases validated in Playground, ensuring that debugging results are captured and re-tested systematically. Datasets can contain inputs, outputs, evaluation metadata, and span context, and support version control—enabling you to pin experiments to specific dataset versions.

The Datasets view enables you to create, view, and manage your experiments’ datasets from the Datadog web UI. The dataset page shows all records, including each record’s input, output, and metadata. You can also manually add new records to the dataset from this view. For example, the following screenshot shows records from a dataset containing question-answer pairs for a personal finance application.

Managing dataset records using the Datasets editor

If you spot an issue with a record, you can click into it and edit all the fields directly within the web UI. To preserve your changes, collaborators can clone their own versions of your dataset and work on them separately. All previous versions of the dataset are stored, so you can troubleshoot problems with an experiment that used an older version of the dataset.

Monitor experiment runs and form insights

Datadog LLM Experiments offers a complete set of tools for creating and monitoring experiments and test datasets. The Experiments SDK lets you write and run experiments that are automatically traced by Datadog and annotated with test dataset records for monitoring with LLM Observability. Every experiment run is automatically instrumented to capture span kinds (LLM, retrieval, embedding, agent, and task spans), giving engineers full visibility into how model calls, tools, and control logic are executed.

Experiments consist of tasks and evaluators. Tasks define the business logic of experiments (determining how your application will run on the provided data), while evaluators are used to score and compare all the outputs produced by tasks. Tasks can be as simple as a single LLM call, or as complex as a multi-agent workflow—Datadog ingests and tracks all subtasks for monitoring via traces. For instance, let’s say you’re experimenting with a RAG application. You could write a task that contains an LLM prompt and uses the RAG pipeline to generate additional context for the system prompt. You could then create evaluators to compute faithfulness, contextual precision, and contextual recall scores for the application’s RAG retriever. Finally, you can run the experiment, which will evaluate the app’s output based on the three scores you have defined.

As soon as you run an experiment, traces of the run and details about the experiment become automatically available in Experiments. You can use Experiments to continually monitor evaluations and characterize application performance, comparing this data across multiple experiments, as well as troubleshoot issues with your experiments using traces.

Analyze experiment results to find optimizations

The Experiment Details page gathers telemetry about all of an experiment’s runs, including duration, errors, and evaluation scores. To understand how an experiment performed across all runs, you can view averages of its evaluations and performance metrics in the Summary section. Then, you can use the Evaluation Distribution to further home in on a subset of experiment runs that had evaluations or metrics within a problematic threshold. This can help you find opportunities for optimizations.

Using the Evaluation Distribution to filter experiment runs by evaluation scores

For example, the preceding screenshot shows the Evaluation Distribution filters that highlight all experiment records with a high token count (indicating verbose and potentially costly inputs and outputs) and a suboptimal correctness_query score. By applying these filters, you can view a list of interesting records, examine the records’ inputs and outputs, and then drill into traces to investigate potential root causes of the highlighted performance issues. Additionally, teams can filter records by operational metrics such as latency, cost, or token count, alongside evaluation scores, to pinpoint high-cost or low-accuracy cases. By inspecting traces, you can see whether issues stemmed from prompt content, retriever performance, or external tool calls.

You can inspect each of these records’ traces in the run side panel. This includes tool and task executions and each individual LLM call used to produce the final output. Continuing our previous example, the following screenshot shows a prompt execution trace for a run with an unusually high token count and low correctness_query score.

Inspecting a record's experiment run traces

By using the trace tab to view each action taken by the application in an experiment run, you can more easily troubleshoot and interpret the application’s behavior and find ways to improve the output. In this example, we’re looking at the system prompt to see whether it included the correct context.

Compare models to find your application’s best fit

Datadog LLM Experiments enables you to aggregate and track experiments across multiple models so you can determine which one best suits your application’s tasks. The main Experiments view lets you filter and sort your experiments to quickly surface instances of poor evaluation scores, high duration, and other issues. As with the Experiment Details page, you can also use the Distribution Filter to hone in on a subset of experiments with evaluations and performance metrics within certain thresholds.

Filtering experiments by evaluation scores

You can then select a group of experiments and compare their results to evaluate application performance across different prompt versions, models, application code versions, and more. For example, let’s say we have an experiment to test the conversational performance of a personal finance AI agent. We’ve run two experiments comparing Claude Sonnet 4 and GPT-4.1. The following screenshot shows a comparison of these runs.

Comparing experiment results for two different models

We can quickly glean that the GPT-4.1 agent outperformed the Sonnet 4 agent on the accuracy_tool and correctness_query evaluations. We can also use this view to investigate inputs and outputs for individual experiment runs and spot outliers. For instance, we might find that the Claude application spectacularly outperformed GPT-4.1 on questions that invoke a specific task. We can then investigate further to either optimize this task for Claude or opt to accept the performance tradeoff to run the task with GPT.

By running side-by-side model comparisons in Experiments, you can make informed tradeoffs across accuracy, style, latency, and cost, rather than relying on anecdotal testing.

Investigate model outputs to evaluate task performance

LLM Experiments gives you granular visibility into the output of every LLM call in your experiments. When you want to dive beyond evaluation scores and performance metrics to investigate how your application performs at a given task, you can examine generated outputs within experiments’ dataset records in the Compare page. This helps you understand LLM performance even in cases where it’s difficult to create effective evaluation metrics.

For instance, if you return to the news summarization example, you might see that both models are producing summaries containing the same key information, but using different phrasing and writing styles. By comparing outputs in the experiments’ dataset records, you can manually determine which model produces more natural-sounding, clear, and clean copy.

Comparing outputs for two different models for subjective evaluation

Engineers can also drill down into span-level traces for those outputs to see which prompts, retrieval steps, or tools contributed to the final generation—helping explain why models diverged in phrasing or accuracy.

Get comprehensive visibility into your LLM experiments

LLM experimentation helps you refine model parameters, evaluate features, and better understand how your application might behave when confronted with different forms of input. With Datadog LLM Experiments, you can start with Playground to debug traces, promote improvements into datasets, run structured experiments with full span-level instrumentation, and analyze results across multiple models and configurations. This end-to-end workflow ensures you can identify regressions early, validate improvements systematically, and develop LLM applications faster and with more confidence.

Experiments is currently generally available to all LLM Observability customers. For more information about functional and operational LLM application performance monitoring in production with LLM Observability, see the LLM Observability documentation.

If you’re brand new to Datadog, sign up for a free trial.