惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
LangChain Blog
Security Latest
Security Latest
P
Proofpoint News Feed
GbyAI
GbyAI
PCI Perspectives
PCI Perspectives
博客园 - Franky
N
Netflix TechBlog - Medium
博客园_首页
WordPress大学
WordPress大学
K
Kaspersky official blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Vercel News
Vercel News
T
Threatpost
The Hacker News
The Hacker News
H
Help Net Security
S
Securelist
Recent Announcements
Recent Announcements
腾讯CDC
T
Tailwind CSS Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Engineering at Meta
Engineering at Meta
C
Cisco Blogs
V
V2EX
C
Check Point Blog
S
Schneier on Security
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Jina AI
Jina AI
M
MIT News - Artificial intelligence
T
Threat Research - Cisco Blogs
博客园 - 叶小钗
A
Arctic Wolf
AWS News Blog
AWS News Blog
Latest news
Latest news
Martin Fowler
Martin Fowler
Recorded Future
Recorded Future
Last Week in AI
Last Week in AI
The GitHub Blog
The GitHub Blog
小众软件
小众软件
B
Blog
aimingoo的专栏
aimingoo的专栏
C
Cyber Attacks, Cyber Crime and Cyber Security
V
Visual Studio Blog
P
Palo Alto Networks Blog
Spread Privacy
Spread Privacy

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything BFF模式详解:构建前后端协同的中间层 I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
LLM Trace Storage Cost: Why Your S3 Bill Exploded, and 3 Fixes
Gabriel Anha · 2026-05-24 · via DEV Community

The Slack message lands on a Tuesday. Finance: "S3 spend tripled last quarter. What changed?" Engineering: "Nothing." Both are correct. Two months ago, someone added LLM tracing (prompts in, responses out, full payload on every span). Nobody set a retention policy. The bucket grew at 14 GB a day, then 22, then 31.

This isn't a corner case. It's the default shape of every LLM tracing pipeline shipped without a payload strategy. The good news: three fixes, ordered by how cheap they are to deploy, each cuts the bill without breaking the workflow you actually use traces for.

Where the bytes go

A normal OTel span is small. Trace ID, span ID, parent, attributes, events, status. Maybe 2 KB if you've decorated it with HTTP method, status code, user ID, region. The kind of span your APM has stored for a decade at a price you stopped looking at.

An LLM span is not normal. It carries the prompt: system message, full chat history, retrieved context, tool definitions, response schema. Then the response. Then sometimes the reasoning trace if you turned that on. A single span on a long agent turn runs 80 KB. A 47-turn agent run hits 4 MB. At 200 requests per second, the payload bytes outweigh the span metadata roughly 50 to 1.

So when you look at S3 and see the bucket growing 30 GB a day, that's not a span explosion. It's text. Text you wrote into the trace because the SDK said "set gen_ai.prompt" and you obliged.

The first instinct is to turn down sampling. That's the wrong instinct. The right one is to think about which payloads earn their storage.

Fix 1: sample on success, full retention on error

Production LLM traffic has two populations. The 99% that worked and look identical to the last 10,000 working traces. And the 1% that failed, errored, timed out, returned nonsense, got flagged by an eval, or generated a complaint. Storing the second population is the whole reason you have tracing. Storing all of the first one is the bill.

A sane default samples the success population aggressively and keeps every failure intact. Tail-based sampling makes this trivial because the decision happens after the span finishes. By then you know whether it errored, whether it tripped a hallucination detector, whether latency went over your SLO.

def should_keep_payload(span):
    if span.status.status_code == StatusCode.ERROR:
        return True
    if span.attributes.get("llm.eval.flagged"):
        return True
    if span.attributes.get("llm.latency_ms", 0) > 5000:
        return True
    # success path: keep 1 in 50
    return random.random() < 0.02

Enter fullscreen mode Exit fullscreen mode

That's it. You still emit the span with metadata, token counts, latency, model name, so dashboards and cost reports stay accurate. You just drop the prompt and response bytes on 98% of the boring traffic.

This single change usually cuts payload storage by 90%+. If you do nothing else from this post, do this.

Fix 2: tiered retention

The second pattern teams under-use. Hot, warm, cold. Three buckets, three lifecycle rules, three price points.

Recent traces are the ones engineers actually open. Last 24 hours, definitely. Last 7 days, often. After that, the access pattern collapses. Somebody pulls a 30-day-old trace once a month during a postmortem. Paying S3 Standard prices for that traffic is theatre.

# s3 lifecycle rule
LifecycleConfiguration:
  Rules:
    - Id: llm-traces-tiering
      Status: Enabled
      Prefix: traces/
      Transitions:
        - Days: 7
          StorageClass: STANDARD_IA
        - Days: 30
          StorageClass: GLACIER_IR
        - Days: 180
          StorageClass: DEEP_ARCHIVE
      Expiration:
        Days: 730

Enter fullscreen mode Exit fullscreen mode

S3 Standard runs ~$0.023 per GB-month. Standard-IA drops to ~$0.0125. Glacier Instant Retrieval lands around $0.004. Deep Archive bottoms at ~$0.00099. The lifecycle transitions cost a few cents per 1k objects, but you make that back inside a day on a busy bucket.

One gotcha worth surfacing: Standard-IA has a 128 KB minimum billing size per object. If you're writing one-span-per-object at small payload sizes, you'll pay for 128 KB even when the object is 4 KB. Batch your trace writes (one object per minute per trace stream, or roll up by trace ID) so each object is at least a few hundred KB. The teams that skip this step end up with IA bills that look like Standard bills and write angry blog posts about how tiering doesn't work.

Fix 3: payload truncation with rehydration tokens

The third fix targets the long tail. A 4 MB agent transcript is the outlier that wrecks averages. You don't want to drop it (the engineer debugging the agent loop needs it) but you also don't want it inlined on the span in your hot trace store.

The pattern: truncate the payload on the span itself, write the full version to object storage under a content-addressed key, and store the key as an attribute. The trace UI shows the truncation up front and offers a rehydrate-on-click button.

def truncate_with_token(payload: str, span, max_inline: int = 2048):
    if len(payload) <= max_inline:
        return payload
    digest = hashlib.sha256(payload.encode()).hexdigest()
    key = f"traces/payloads/{digest[:2]}/{digest}.txt"
    s3.put_object(Bucket=PAYLOAD_BUCKET, Key=key, Body=payload)
    span.set_attribute("llm.payload.s3_key", key)
    span.set_attribute("llm.payload.full_bytes", len(payload))
    return payload[:max_inline] + f"\n…[truncated, rehydrate: {digest[:12]}]"

Enter fullscreen mode Exit fullscreen mode

Content-addressed keys mean identical payloads (system prompts, common tool definitions, repeated user queries) dedupe for free. On a real agent workload that's another 40-60% storage win because system prompts are the same on every span and you stop paying to store the same 8 KB block a million times.

A 40-line OTel SpanProcessor that does all three

This is the version that ships. It's a BatchSpanProcessor wrapper that runs sampling, truncation, redaction, and rehydration-token rewriting before the span hits the exporter. Drop it in front of whatever exporter you use (Tempo, Honeycomb's OTLP endpoint, an S3-backed pipeline).

import random, hashlib, re
from opentelemetry.sdk.trace import SpanProcessor
from opentelemetry.trace import StatusCode

PII = [
    (re.compile(r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b"), "[email]"),
    (re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),        "[ssn]"),
    (re.compile(r"\b(?:\d[ -]*?){13,16}\b"),      "[card]"),
]
PAYLOAD_ATTRS = ("gen_ai.prompt", "gen_ai.response", "llm.input", "llm.output")

class LLMPayloadProcessor(SpanProcessor):
    def __init__(self, downstream, s3, bucket, inline_limit=2048):
        self.downstream, self.s3, self.bucket = downstream, s3, bucket
        self.inline_limit = inline_limit

    def _keep_full(self, span) -> bool:
        if span.status.status_code == StatusCode.ERROR: return True
        if span.attributes.get("llm.eval.flagged"):     return True
        if span.attributes.get("llm.latency_ms", 0) > 5000: return True
        return random.random() < 0.02

    def _redact(self, text: str) -> str:
        for pattern, repl in PII:
            text = pattern.sub(repl, text)
        return text

    def on_end(self, span):
        keep = self._keep_full(span)
        for key in PAYLOAD_ATTRS:
            raw = span.attributes.get(key)
            if not raw: continue
            clean = self._redact(raw)              # PII out before anything else
            if not keep:
                span._attributes[key] = "[sampled-out]"
                continue
            if len(clean) > self.inline_limit:
                digest = hashlib.sha256(clean.encode()).hexdigest()
                obj_key = f"traces/payloads/{digest[:2]}/{digest}.txt"
                self.s3.put_object(Bucket=self.bucket, Key=obj_key, Body=clean)
                span._attributes[key] = clean[:self.inline_limit] + f"\n…[rehydrate:{digest[:12]}]"
                span._attributes[f"{key}.s3_key"] = obj_key
            else:
                span._attributes[key] = clean
        self.downstream.on_end(span)

    def shutdown(self): self.downstream.shutdown()
    def force_flush(self, timeout_millis=30000):
        return self.downstream.force_flush(timeout_millis)

Enter fullscreen mode Exit fullscreen mode

Wire it up like this:

tracer_provider.add_span_processor(
    LLMPayloadProcessor(
        downstream=BatchSpanProcessor(OTLPSpanExporter()),
        s3=boto3.client("s3"),
        bucket="acme-llm-payloads",
    )
)

Enter fullscreen mode Exit fullscreen mode

A few things that aren't accidents in the code above. PII redaction runs before the sampling check, so even the dropped payloads have been cleaned in case downstream logging picks them up. Content-addressed S3 keys give you free deduplication. The s3_key attribute is what the trace UI uses to offer rehydration, and you can write a tiny Lambda behind a signed URL to serve it. The sampling thresholds are tunable per environment. Error rate in dev is 30%, so the "always keep errors" rule won't bury you there.

The mutation of span._attributes is the one rough edge. OTel's public API treats span attributes as immutable after start, but BatchSpanProcessor runs on_end on a worker thread where the span is no longer being written to. In practice this is safe. If you want to be strictly correct, wrap the span and re-emit it through a custom exporter instead.

The gotcha: PII redaction before storage, not on retrieval

The instinct that bites teams hardest is "store the raw payload, redact on read." It seems reasonable. You keep the original in case you need it. You serve a redacted version to engineers without a need-to-know. You move on.

Then GDPR shows up, or SOC2, or a customer DSR. Now you owe an auditable answer to "how was personal data stored, who accessed it, how do we delete it?" The answer "we redact on read" means raw PII is sitting in S3 indefinitely. That's the storage event regulators care about, not the display event.

Redact in the span processor, before the export ever happens. The truncated payload that goes to S3 should already have emails, SSNs, card numbers, phone numbers, and any domain-specific identifiers (customer IDs, internal account numbers) replaced with tokens. Keep a separate, access-controlled, encrypted-at-rest pathway for the rare cases where the raw payload is needed for an incident. Make that pathway opt-in per request, not the default state of your trace store.

The redaction step in the processor above is the minimum. Add a per-tenant rule layer if you serve regulated industries. And run the redactor against your own eval set monthly, because the day someone adds a new entity type and forgets to update the regex is the day raw card numbers start hitting your trace store again.

What this gets you

Done together (sample-on-success, tiered retention, truncate-with-rehydration, redact-before-store) the same workload that was costing $9k/month in S3 lands somewhere between $400 and $900. Debugging stays intact because errors and slow paths keep full payloads. Compliance posture improves because PII isn't sitting in object storage waiting to be discovered. Engineers don't notice the change in the trace UI except that long agent traces now have a "load full transcript" button.

The thing nobody tells you when you ship LLM tracing: payload economics are the design decision. Spans are free. Prompts and responses are not.

What's the worst LLM observability bill surprise you've seen, and which of these three fixes would have caught it? Drop it in the comments.


If this was useful

The trade-offs in this post (sampling shape, retention tiers, payload handling, where redaction belongs in the pipeline) are exactly what my LLM Observability Pocket Guide walks through. The chapter on trace pipeline design covers the SpanProcessor patterns above plus the eval-flagging and self-consistency detectors that make sample-on-success actually targeted instead of random. Worth a read if you're picking a tracing tool or trying to make the one you have less expensive.

LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team