惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

Datadog | The Monitor blog

Introducing this year’s new Datadog Ambassadors and the new Datadog Champions program Measure the real impact of AI coding tools on software delivery with Datadog AI Impact How to measure developer experience (DevEx) in the AI era Improve API authentication detection with Datadog Securing AI agents: Why guardrail placement is a key design decision Project and manage cloud spend with Datadog budget forecasting How to audit and clean up monitors effectively Reduce CVE noise with OpenVEX assessments in Datadog Diagnose slow PostgreSQL queries faster with explain plan correlation Explore Datadog metrics with Natural Language Queries Toto 2.0: Time series forecasting enters the scaling era Simplify micro-frontend observability with Datadog RUM Attribute AI costs across providers with Datadog Cloud Cost Management Diagnose and resolve database performance issues faster with Database Investigator Datadog for Government achieves FedRAMP® High certification Analyze cloud costs with flexible spreadsheets in Datadog Sheets Inside Datadog’s AI Research Lab: Meet two PhD candidates behind Toto Connect triage and investigation in a single workflow with Datadog Cloud SIEM This Month in Datadog - April 2026 Monitor and optimize Supabase query performance with Datadog Database Monitoring Add dynamically updating context to logs with Reference Tables and Observability Pipelines Introducing ARFBench: A time series question-answering benchmark based on real incidents The product signal latency gap slowing your growth Test network paths with TCP, UDP, and ICMP in Datadog Turn developer feedback into operational insight with Datadog Forms and Sheets How to investigate cloud credential compromise with Bits AI Security Analyst Evaluate, optimize, and secure your Google Cloud AI stack with Datadog Bringing observability data hosting to the UK on AWS Identify and fix code issues faster with Datadog’s Azure DevOps Source Code integration Steganography at scale: Embedding share URLs in Datadog widget screenshots Every team should be A/B testing Centralize observability management with Datadog Governance Console Spotting CI/CD misconfigurations before the bots do: Securing GitHub Actions with Datadog IaC Security Route OTel data from AI apps to ClickHouse and Datadog using Observability Pipelines Manage service tracing across hosts with Single Step Instrumentation rules Offline evaluation for AI agents: Best practices Detect runtime threats in Python Lambda functions with Datadog AAP Introducing our open source AI-native SAST Instrument and monitor Boomi integration flows with OpenTelemetry and Datadog Not all index scans are equal: How we cut query latency by over 99% Platform engineering metrics: What to measure and what to ignore Integrate Recorded Future threat intelligence with Datadog Cloud SIEM CI/CD security: threat modeling using a MITRE-style threat matrix CI/CD security: How to secure your GitHub ecosystem Ingress NGINX is EOL: A practical guide for migrating to Kubernetes Gateway API How we built a real-world evaluation platform for autonomous SRE agents at scale Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA Introducing the Datadog Code Security MCP Capture and analyze custom heatmaps in Session Replay Understand session replays faster with AI summaries and smart chapters Monitor ClickHouse query performance with Datadog Database Monitoring How we designed empathetic alert sounds for on-call engineers Search and act across Datadog to resolve issues faster with Bits Assistant Measure the business impact of every product change with Datadog Experiments Analyzing round trip query latency Configuring JavaScript caches for better performance Introducing Bits AI Dev Agent for Code Security Datadog achieves ISO 42001 certification for responsible AI Monitor Nutanix clusters, hosts, and VMs with Datadog Monitor Juniper Mist in Datadog A new Host Map for modern infrastructure When upserts don't update but still write: Debugging Postgres performance at scale Annotate traces to improve LLM quality with Datadog LLM Observability What's new in Cloud SIEM: AI-powered investigations, enhanced threat intelligence, and scalable security operations Explore Kubernetes with native OpenTelemetry data Monitor Oracle Fusion Cloud Applications with Datadog Announcing the Datadog Terraform provider v4.0.0 Scaling Kubernetes workloads on custom metrics How to design cloud environments for AI-powered threat analysis Monitor Aruba Central in Datadog How we centralize and remediate risks with Datadog Case Management Accelerate incident response with Datadog and ServiceNow Monitor your application and network load balancer logs Understanding Karpenter architecture for Kubernetes autoscaling Tools for collecting metrics and logs from Karpenter Monitor Karpenter with Datadog What your product data is actually saying Key metrics for monitoring Karpenter Securing Datadog's platform in the AI age: The role of observability data Closing the verification loop: Observability-driven harnesses for building with agents When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos Closing the verification loop, Part 2: Fully autonomous optimization Four ways engineering teams use the Datadog MCP Server to power AI agents Approaching your observability migration with the right mindset Meet the new Bits AI SRE: Deeper reasoning, twice as fast Designing MCP tools for agents: Lessons from building Datadog's MCP server Key learnings from the 2026 State of DevSecOps study Use plain English to query your multi-cloud infrastructure in Resource Catalog Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring Protect your OCI resources with Datadog Cloud Security This Month in Datadog - February 2026 Fine-tune Toto for turbocharged forecasts Amazon EC2 security: How misconfigured and public AMIs expand your cloud attack surface Enable end-to-end visibility into your Java apps with a single command Measure and improve mobile app startup performance with Datadog RUM Evaluating our AI Guard application to improve quality and control cost Identify untested code across every level of your codebase Make use of guardrail metrics and stop babysitting your releases Monitor Versa Networks SD-WAN performance in Datadog How we reduced the size of our Agent Go binaries by up to 77%
How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability
2026-05-20 · via Datadog | The Monitor blog

Without experiment infrastructure to help you test your LLM applications, every research session starts with the same questions: What have we tried previously? What were the numbers? Which prompt version produced that result? Why did we discard that approach? The answers live in scattered notes, terminal history, and half-remembered conversations. Each handoff between sessions loses context. In practice, iteration can slow down as teams get bogged down in testing and analysis.

The Datadog team responsible for building and maintaining Database Monitoring (DBM) needed to tackle these challenges in order to explore whether an AI agent could augment DBM’s automated query optimization recommendations. The DBM team used Karpathy’s autoresearch tool to trigger 23 autonomous experiments that brought the query optimization recommendation agent from precision scores of P=0.54 to P=0.86 overnight. Through this iterative process, the team proceeded through three phases:

  1. Optimizing the prompt and tool chain

  2. Rightsizing the model for an appropriate cost-performance tradeoff

  3. Breaking the LLM call into two separate passes to break through a final performance barrier

In this post, we’ll discuss the autoresearch-powered experimentation process in depth, exploring how the team planned and executed rapid iteration of the agent by using LLM Observability Experiments to track, analyze, and act on the experiment results.

Augmenting DBM’s query optimization recommender with agentic AI

DBM’s query optimization recommender currently uses a multi-source heuristic engine (written in Go) that combines SQL parse-tree analysis, real explain plans, schema metadata, and runtime metrics to detect optimization opportunities. It covers six pattern families:

  • Missing index detection with plan-flip analysis (detects when the planner alternates between strategies)

  • SELECT * expansion with schema-aware column enumeration

  • ORDER BY without LIMIT with metrics-based row-count thresholds

  • OFFSET without ORDER BY (pagination correctness)

  • Idle-in-transaction detection via activity event analysis

  • Comprehensive SQL rewrite rules (OR to ANY, CAST normalization, date-to-range, CTE filter pushdown, and more)

This engine is precise by design. Each pattern is validated against actual database context. Explain plans use relative cost filtering. Metrics-based scoring avoids false positives on small result sets. On our evaluation dataset, it achieves a precision score P=0.903.

The DBM team wanted to see if an AI agent could run after the heuristic engine to discover additional optimization patterns. They hypothesized that an agent could discover types of patterns that are harder to encode as individual heuristic rules because they require cross-referencing multiple signals or understanding subtle semantic tradeoffs. For example:

  • A sequential scan on an indexed column might mean stale statistics (needs ANALYZE), not a missing index.

  • A covering index exists but is not being used as an index-only scan, suggesting a stale visibility map (needs VACUUM).

  • An expensive aggregation query running 15,000 times could benefit from a materialized view.

These kinds of rules require reasoning that combines schema knowledge, plan analysis, and performance judgment. They are difficult to express as individual rules but could be more natural for an AI agent that can see all the signals together.

The team began testing this hypothesis by feeding an LLM a set of queries with a simple zero-shot prompt: no domain rules, just “analyze this SQL.” It surfaced many more patterns (a recall score R=0.90), but nearly half the suggestions were wrong (P=0.54).

SystemPrecisionRecall
Heuristic engine0.9030.633
AI agent (zero-shot)0.5430.898

In other words, the heuristic engine was more precise at finding valid optimizations, but the LLM could find a broader set of potential optimizations. In order for the agentic solution to be practical, the team had to figure out if they could teach the agent to be more precise while preserving this greater breadth. Next, we’ll discuss how the team answered this question by creating a rigorous evaluation dataset and an experiment infrastructure that enables fast iteration.

Building the experiment

In this section, we’ll discuss how the team created the data, evaluators, and experiment infrastructure they used to iterate their SQL optimization agent.

The dataset

To build the evaluation dataset, the team created 100 cases across five types: rewrites, missing indexes, anti-patterns, maintenance, and schema changes. Each case includes the SQL query and the telemetry the agent would see in production: schema, explain plans, metrics, and transaction stats. Of these, 30% are negative cases (queries that need no optimization).

The DBM team created these test cases programmatically using the LLM Observability SDK, as shown in the following code snippet:

from ddtrace.llmobs import LLMObs

LLMObs.enable(site="datadoghq.com", api_key="...", project_name="query-optimization")

records = [

{

"input": {

"sql": "SELECT id, user_id FROM sessions WHERE status = 'expired'",

"telemetry": {

"schema": {"tables": {"sessions": {

"columns": [{"name": "id", ...}, {"name": "status", ...}],

"indexes": [{"name": "idx_status", "definition": "CREATE INDEX ... (status)"}]

}}},

"events": {"explain_plans": [{

"definition": {"Plan": {"Node Type": "Seq Scan", "Total Cost": 35000,

"Plan Rows": 100, "Rows Removed by Filter": 99900}}

}]},

},

},

"expected_output": {

"optimizations": [{"type": "Maintenance", "match_key": "maintenance:analyze:sessions"}]

},

"metadata": {"case_id": "E08", "category": "plan_analysis", "difficulty": "hard"},

},

# ... 99 more cases across 11 optimization types

]

dataset = LLMObs.create_dataset(dataset_name="pg-optimization-v1", records=records)

The evaluators

Once the dataset was in place, the team configured evaluators to measure the agent’s performance. These included precision, recall, and F1 scores. This way, they could compare the precision-recall tradeoff achieved in each agent iteration with a single heuristic marker (F1), as well as compare precision and recall scores across experiments. The following screenshot shows how these evaluators are displayed for each experiment run in LLM Observability Experiments.

Screenshot of the LLM Observability Experiments list view showing 20 of 23 experiment runs. Each row displays a status badge, experiment name, Judge F1, Judge Precision, Judge Recall scores, dataset name, time since last run, and the experimenter. Experiment names include “blind-haiku-twopass,” “twopass-cross-model-verify,” “twopass-surgical-verifier,” and others. The top result, “blind-haiku-twopass,” shows the highest F1 of 0.803 with precision 0.860 and recall 0.823.

The autoresearch infrastructure

Karpathy’s autoresearch is a setup where you give an AI agent a small but real LLM training codebase and let it experiment autonomously overnight. The agent modifies train.py, trains for five minutes, checks if the result improved, keeps it or discards it, and repeats. You wake up in the morning to a log of experiments and a better agent.

The design is deliberately simple:

  • One GPU

  • One file the agent edits (train.py)

  • One metric (validation bits per byte)

  • One file the human edits (program.md, the instructions that define the research direction). 

The key idea is that humans are not designing individual experiments. The team sets parameters for the research by writing program.md, and the agent does the rest: proposing changes, running experiments, evaluating results, and deciding what to try next. The agent runs about 12 experiments per hour—roughly 100 overnight.

While autoresearch is designed to optimize model training, the DBM team wanted to apply the same methodology to AI agent development, where the “weights” being tuned are prompts, skills, and tools rather than neural network parameters. The DBM team adapted Karpathy’s tool to iterate the SQL optimization agent; 23 experiments produced 17 kept improvements.

In this case, the configured evaluators form the objective function that the autoresearch agent loop optimizes against. First, the team set a concrete target for this function: P>=0.85, R>=0.85 on a small model. Then, they set a fixed time budget of 15 minutes for each experiment run. Finally, they defined the intended agent behavior in a HANDOFF.md document. This document defines the current state, the error analysis, and the next hypotheses. A coding agent running in the autoresearch environment reads the handoff, designs experiments, runs them via LLM Observability Experiments, analyzes per-case failures, and writes the updated handoff for the next session.

Experiment code for one of these autoresearch runs is shown in the following snippet:

def optimization_task(input_data, config=None):

"""Your agent, wrapped as an experiment task."""

return run_optimization(

sql=input_data["sql"],

telemetry=input_data["telemetry"],

model=config.get("model", "anthropic/claude-haiku-4-5"),

)

experiment = LLMObs.experiment(

name="haiku-self-verify",

task=optimization_task,

dataset=dataset,

evaluators=[judge_precision, judge_recall, judge_f1],

config={

"model": "claude-haiku-4-5",

"prompt_version": "v20h",

"phase": "distillation",

"goal": "Add self-verification step to boost precision",

"expectation": "+2pp P from double-checking before output"

},

description="Self-verification: model reviews each suggestion against evidence before including it.",

)

result = experiment.run(jobs=10)

Each experiment is tagged with the hypothesis (goal), the prediction (expectation), and the research phase. LLM Observability Experiments records all of this as structured metadata alongside the per-case results and agent traces. When the automated driver analyzes failures in the next iteration, this metadata is what it reads to decide what to try next.

Running the experiment

The experiment ran in two phases of eight experiments each: first, optimizing the agent’s system prompt, tool descriptions, and worked examples on a large model, and then finding the best way to compress to a smaller model while retaining the desired evaluation targets. The first two phases produced a result just beneath the target precision score of 0.85. A third phase ran seven more experiments to implement a two-pass solution that finally reached the team’s target. In this section, we’ll discuss how the agent was iterated through each of these phases.

Phase 1: Prompt and tool iteration on a large model

In Phase 1, the autoresearch loop ran eight experiments on Claude Sonnet 4.6, starting with a zero-shot prompt (P=0.543, R=0.898) and iterating across three levers: the system prompt, the tool descriptions, and the worked examples.

The agent used seven tools that mirror production telemetry APIs: get_table_schema, get_explain_plans, get_query_metrics, get_idle_in_transaction_stats, and others. Early experiments focused on how the prompt instructs the agent to use these tools and interpret their output.

These runs produced three key turning points:

  • Structured output and evidence rules pushed precision from 0.54 to 0.83 across the first few experiments. Requiring the agent to cite tool evidence (explain plan costs, schema indexes) before suggesting optimizations eliminated most hallucinations.

  • Relaxing rules regressed. One experiment loosened missing-index co-occurrence rules, hoping to recover recall. Both precision and recall dipped. 

  • Worked examples broke through. Adding three examples of what not to optimize (high-selectivity scans, subqueries with OFFSET, stale statistics) pushed precision to 0.878 while holding recall at 0.858. 

After these iterations, blind evaluation on 50 more unseen cases confirmed no overfitting: P=0.870, R=0.830, as shown in the following screenshot:

Screenshot of the LLM Observability Experiments Compare view showing a side-by-side comparison of two experiments: “sonnet-zero-shot” (baseline) and “sonnet-worked-examples” (variant). A results table shows the variant reduced average duration from 22.5s to 15.7s (38% faster), improved F1 from 0.55 to 0.776 (41.2% increase), improved precision from 0.543 to 0.878 (61.4% increase), and slightly reduced recall from 0.898 to 0.858 (4.5% decrease), across 108 experiment runs.

This view in LLM Observability Experiments lets you compare two experiments side by side. Here, we compare the initial zero-shot starting point against the final result of Phase 1. The precision gain is clear: The Phase 1 version’s precision was 61.6% higher.

The team could also review full traces of this experiment run within LLM Observability’s trace visualization. In the following screenshot, we can see how the agent called resolve_sql, get_explain_plans, get_query_metrics, and get_table_schema before producing its recommendation.

1779221960-phase1-2

Phase 2: Compressing to a small model

Claude Sonnet 4.6 worked well for Phase 1, but at three times the cost of Haiku 4.5 ($3 input/$15 output per MTok versus $1 input/$5 output per MTok), it made sense to see if the quality gains could be compressed to the smaller model. The autoresearch driver explored two approaches for this.

First, it tried to directly transfer the Sonnet prompt to Haiku and find optimizations. Iterations that streamlined the prompt and added more worked examples failed to make up the response quality deficit introduced by running the original prompt on Haiku instead of Sonnet.

Applying a more rigorous, knowledge distillation–style approach broke through the challenge. The agent compared Sonnet and Haiku traces on the same cases in LLM Observability. In cases where Haiku got the wrong answer, the agent could directly compare with Sonnet for the same input and see exactly how it reasoned: which tools it called, what evidence it weighed, and how it arrived at the correct optimization type. The traces revealed that Haiku was confusing missing indexes with stale statistics and schema changes. The agent extracted four examples from Sonnet’s correct reasoning and added them to Haiku’s prompt. Both precision and recall improved.

The loop also experimented with on-demand skills: reusable instructions the agent can invoke for specific tasks like evidence gathering for missing index recommendations. Combining all hypotheses (distilled examples, skills, tool call hints) at once was unstable, but selective combinations worked better. The best single-pass Haiku version used distilled examples plus a self-verification step.

After another blind evaluation on 50 unseen cases, the agent confirmed that the new Haiku prompts generalize. Filtering by model name in LLM Observability surfaces just the Haiku experiments, making it easy to track progress within a single model family. The following screenshot shows the results of this test: P=0.837, R=0.823.

Screenshot of the LLM Observability Experiments view filtered by the search term “haiku,” comparing 11 experiments across 3 fields. A timeline chart plots judge_f1 (blue), judge_precision (pink/orange), and judge_recall (yellow) scores over time. The chart shows generally upward trends for precision and F1 across the session, with recall staying relatively stable, ending with precision and recall both near 0.8.

Breaking through the single pass ceiling

These results were strong, but just shy of the P=0.85 target the team had set. However, the autoresearch driver couldn’t find a way to improve them any further while sticking to a single Haiku call. The driver proposed splitting the problem into two passes: a high-recall detector followed by a surgical verifier.

The first iteration of the verifier was too aggressive and significantly reduced recall (P=0.921, R=0.588). The second was too soft to bring precision above the bar. The third struck the best balance by checking only five specific false-positive patterns identified through per-case error analysis.

The autoresearch agent also tested cross-model verification (Sonnet as a verifier for Haiku) and distillation to GPT-5.4 nano. But the aforementioned Sonnet-only approach worked the best. A final blind evaluation check on 50 unseen cases produced P=0.860, R=0.823, F1=0.803, as shown in the following screenshot.

Screenshot of the LLM Observability Experiments Compare view contrasting “blind-haiku-single-pass” (baseline) against “blind-haiku-twopass” (variant) across 50 unseen cases. The two-pass variant increased average duration from 11.9s to 19.1s (60.5% longer), improved F1 from 0.779 to 0.803 (3.0% increase), improved precision from 0.837 to 0.860 (2.8% increase), and held recall steady at 0.823 (0.0% change).

23 experiments later

The following graph shows the full journey taken by the autoresearch agent. It performed 23 experiments across three phases. Each discarded experiment narrowed the search space and informed the next hypothesis. 17 improvements were kept, while 6 were discarded. F1 progressed from 0.550 (zero-shot) to 0.803 (two-pass Haiku).

Screenshot of the LLM Observability Experiments timeline view comparing all 23 experiments across 3 metric fields, split visually into two sections: “Sonnet optimization” on the left and “Haiku optimization” on the right. Three colored metrics—judge_f1, judge_precision, and judge_recall—are plotted as dots over time, showing an overall upward trend from low precision scores in early Sonnet experiments to converging high scores in the final Haiku two-pass experiments. The final highlighted data point shows scores reaching approximately 0.8 or above across all three metrics.

At each step, the autoresearch reasoning and analysis output was saved to the corresponding experiment in LLM Observability Experiments as an audit log. This experiment infrastructure made it easy for the DBM team to track and analyze each step of this process, so they could find key learnings and understand what the autoresearch system had produced. LLM Observability Experiments enabled this by making every experiment a first-class object with:

A single source of truth

Every experiment records its configuration (model, prompt version, variables changed), its hypothesis (goal and expectation tags), and its results (per-case precision, recall, and F1 from the LLM judge). There is no “I think we tried that,” because the experiment list shows exactly what was tried and what happened. It’s also easy to surface experiments with common attributes (model or prompt version, tool path, etc.) and compare their evaluator scores. This makes validating experiments’ performance gains much simpler and more reliable.

Per-case trace inspection

When an experiment regresses, you need to understand why at the case level. LLM Observability Experiments captures the full agent trace for every case: which tools were called, what the model reasoned about, and what it produced. We used this to discover that Haiku was recommending new indexes when the real problem was stale statistics, which directly informed the distillation examples.

Filtering and grouping

Each experiment is tagged with phase, model family, and the variable that was changed (prompt, example, architecture). Filtering by haiku surfaces just the 11 Haiku experiments. Grouping by variable type reveals that architecture changes produced the biggest gains. These queries let you ask, “What have we tried on this model?” and get an answer in seconds.

Reproducibility

Every experiment command is deterministic: the same dataset, the same model, the same prompt version. If a result looks surprising, you can rerun the experiment and compare. The loop ran blind evals after each phase specifically because the experiment infrastructure made it cheap to do so.

The autoresearch loop produces experiments at a pace that overwhelms manual tracking. At four to eight experiments per session, the research history becomes unmanageable within a week. By supporting this process with LLM Observability Experiments, the DBM team was able to make the system practical and sustainable.

Try it yourself

This agentic experimentation methodology works for any AI agent, not just query optimization. The ingredients:

  1. An evaluation dataset with real inputs, expected outputs, and metadata

  2. A task function that wraps your agent

  3. Evaluators that score output quality

  4. The loop: hypothesize, experiment, measure, keep or discard

To learn more about running your own experiments, see our guide for building offline evaluations, and dive into the LLM Observability Experiments documentation. LLM Observability now has a free tier for your first 40,000 LLM spans. If you’re new to Datadog, sign up for a 14-day free trial.