惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

DEV Community

What Would WordPress Look Like If It Were Designed Today? George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨ Road to Senior #2: How Computers Think in Numbers Why LLM debugging fails on fragmented repository context How to Deploy a LangGraph Agent on AWS Bedrock AgentCore An outreach kit for solo founders whose drafts can't hallucinate Open Satchel is live Amy Kwalwasser and the Growing Importance of Quantum Risk Modeling I Built ShellReq - A Native API Client for VS Code & Terminal If Microsoft and Uber can't afford AI coding, what chance do the rest of us have? MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To Why most AI fails at IDOR (and how AMAS fixes it with causal reasoning) How to Audit a Laravel Codebase You've Inherited LangGraph 워크플로우 템플릿 (v34) BugBench: a developer origin story and practical guide for VS Code / Kiro users A solution to messy token systems for Next.js A NestJS reference app that proves the nest-native stack under realistic backend pressure Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production I Thought “Data Analyst” Was the Whole Game… Then I Entered the Data Avengers Office 👀 Create and configure network security groups How to analyze the cost of Kafka? How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow [Boost] We built MDCMS, a Markdown-first CMS for teams using AI agents Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x The Minimum Viable Test Suite for Working with Agents Why Perplexity Started Citing My Blog: 5 Changes That Actually Worked Sync Supabase via OAuth: No Connection String Needed I asked three AI models the same API question. Only one had it right. Implementing Saga Pattern With Lambda Durable Function Why does AI forget what you said (and how to fix it) I built a daily Wordle-style game for AI tools - Here's how Mapping Polish company structures: querying KRS direct via API Built tmpdrop — a tiny self-hosted ephemeral file drop Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3 LLD Object-Oriented Design: Interfaces & Abstract Classes (Designing Contracts) The Smaller Ship: Vitalik, the Ethereum Foundation's Restructuring, and What It Leaves for Investors Looking for 4 people to build something weird with me Building a Local-Only RAG System with Ollama and TypeScript The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion Solana Passkey Wallet: Replacing Seed Phrases with SIMD-0075 I built a small browser puzzle game about arrows I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong. Mobile Game Optimization: A Unity Developer's Checklist Git: Best Practices for Beginners Three days I lost chasing a ghost that was already dead on disk Why Too Many Parts Hurt ClickHouse Performance Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls Gemma Forge: Local AI Without the Setup Wall From Half‑dead Prototype to Local‑Only AI Medical Assistant: Rewiring MedClinic with GitHub Copilot Runninig a forkbomb in Jenkins What’s Actually Happening When You Use Git Preventing Recursive Tool Loops in LangChain Agents Building a Rock-Paper-Scissors CLI with TypeScript — Union Types, Conditionals, and Jest Your AI Coding Agent Wastes 80% of Its Context. Fixed That with Graph Theory. Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account One API Call Changed Everything The Internet Career Nobody Talks About Enough: What Is DevRel? Solar Panel Wiring Diagram: Series vs Parallel Hello everyone! Glad to join the dev.to community I Built an AI Agent That Tailors My Resume - Here's How Agents Actually Work I Built a WhatsApp OTP + AI Chatbot Platform for African Businesses MTP Explained — And Why It Matters for Android on Mac Most Beginners Learn Full-Stack Development Backwards GitHub Glow-Up: Open Source, READMEs, Badges, Streaks, Git and gh CLI System Design Cheat Sheet: Concepts Every Developer Should Know Are Junior Developer Roles Actually Dying? A Fresher's Honest Take Using DigitalOcean Droplets as Ephemeral Sandboxes for AI Agents I built a VSCode extension that visualises your code navigation as a call tree — made for legacy codebase pain Vite predev/prebuild: chaining scripts without losing your mind A website to save you from messy browser tabs Dear Web2 Developer... Solana is here calling Postgres JSONB indexes: GIN vs BTREE on the same column
We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.
Marcus Chen · 2026-05-26 · via DEV Community

TL;DR: We pulled 41,000 production agent traces at Nexus Labs to build a fine-tuning dataset. After a manual audit of 1,200 of them, ~48% were unusable: tool calls that "succeeded" but returned wrong data, retries masking provider failures, and silent fallbacks that changed which model answered. Putting Bifrost in front of the agent fleet fixed the trace problem more than any sampling strategy we tried.

We run an enterprise agent product. Sales-ops automations mostly. Each user task ends up as a chain of 8-40 tool calls across a planner model, a worker model, and roughly 12 internal tools.

For the last quarter my team has been building a fine-tune dataset from real traces. The plan was straightforward. Pull successful task completions. Filter by user thumbs-up. Use the trace as the training signal.

It did not work.

What "successful" actually meant in our traces

The first audit pass was 1,200 traces, two engineers, three weeks. We tagged each trace as "clean", "noisy", or "corrupted".

Category % of traces What it meant
Clean 52% Tool calls returned correct data, model picked the right next step
Noisy 31% Right answer eventually, but with hidden retries, fallback to a different model, or stale cache hits
Corrupted 17% Trace claimed success, output was wrong. User had not noticed yet.

The noisy category is the one that broke me. We had been treating these as gold-standard data. A trace where the planner called crm_lookup, got a 500, retried twice, then succeeded on a fallback Anthropic key while the original trace span still pointed at OpenAI gpt-4o. The training pair we would have generated: "given this user input, output this tool call sequence." But the sequence was the result of three providers and two model versions stitched together. No reproducibility.

Worse: nothing in our trace told us which model actually produced the final answer. We had a model field. It logged whichever provider was configured at request start.

Why we ended up putting a gateway in front of everything

We tried two things first. Both partial fixes.

The first was logging at the application layer. Wrap every provider call, log model, latency, retry count, fallback path. This works until you have four services calling four SDKs with four retry policies. Our Python service used the official openai client. Our Go service used a hand-rolled HTTP client. The TypeScript planner used Vercel AI SDK. Three different definitions of "retry".

The second was forcing all traffic through LiteLLM. It got us to a unified call surface but the observability was thin for our needs, and the failover behaviour was harder to reason about under load. Not a knock on LiteLLM, it just was not the shape we wanted.

We migrated the fleet behind Bifrost about five months ago. Two reasons specific to our problem:

  1. The Automatic Fallbacks config makes the fallback chain a first-class object. When a request fails over from Anthropic to Bedrock, that is in the response metadata. Not in three different log lines you have to join.
  2. Native Prometheus metrics (observability docs) meant bifrost_requests_total is tagged by the actual provider that served the request, not the one we asked for.

Here is a chunk of the config that mattered for trace cleanup:

providers:
  openai:
    keys:
      - value: env.OPENAI_API_KEY_1
        weight: 0.7
      - value: env.OPENAI_API_KEY_2
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_API_KEY

fallbacks:
  - model: openai/gpt-4o
    fallback_to:
      - anthropic/claude-sonnet-4-6
      - openai/gpt-4o-mini

logging:
  include_fallback_chain: true
  include_provider_actual: true

Enter fullscreen mode Exit fullscreen mode

The two include_* flags meant every trace span we emitted downstream had a deterministic answer to "who served this token". Our corrupted-trace rate on the next 5,000 sampled dropped from 17% to under 3%.

What the audit actually changed about our fine-tuning

We stopped using user thumbs-up as the primary filter. Thumbs-up correlates with "user got what they wanted eventually", not "the model made the right call". Now the filter is:

  • Single-provider, single-model trace (no fallback fired)
  • No retry on any tool call
  • Tool call result schemas validated post-hoc against a recorded ground truth
  • Span timing within 1.5x median for that task class

That filter throws away about 71% of our raw traces. Painful. But the 29% that survives is data we can actually train on.

Trade-offs and limitations

Honest take on what this did not solve.

  • Bifrost is not a debugger. It tells you which provider served the request and whether a fallback fired. It does not tell you whether the tool result was correct. We still need the post-hoc schema validation pass.
  • Semantic caching (docs) made the corruption worse before it got better. Cache hits looked like fresh model calls in our old logging. We had to explicitly tag cached responses in the trace pipeline. Once tagged, fine, but the default was confusing.
  • LiteLLM has a larger provider list at the long-tail. If you need niche providers, check both before committing.
  • Portkey's prompt management UI is nicer. We do prompt management elsewhere so it did not matter for us. If you want one tool for both, Portkey is worth a look.
  • The MCP gateway feature (docs) is interesting but we have not put it in production. Cannot vouch for it yet.

The model is the easy part. The infrastructure around the trace is where your eval dataset lives or dies.

Further Reading

  • Bifrost retries and fallbacks docs
  • Bifrost observability defaults
  • LiteLLM proxy docs for honest comparison
  • Anthropic's tool use guide — the trace structure section is the relevant one
  • OpenTelemetry GenAI semantic conventions — what we wish our old logging had matched