惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

云风的 BLOG
云风的 BLOG
Recorded Future
Recorded Future
博客园_首页
人人都是产品经理
人人都是产品经理
阮一峰的网络日志
阮一峰的网络日志
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
罗磊的独立博客
S
Schneier on Security
V
Vulnerabilities – Threatpost
C
Cybersecurity and Infrastructure Security Agency CISA
S
SegmentFault 最新的问题
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
K
Kaspersky official blog
Apple Machine Learning Research
Apple Machine Learning Research
A
Arctic Wolf
Hacker News: Ask HN
Hacker News: Ask HN
SecWiki News
SecWiki News
The GitHub Blog
The GitHub Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
The Register - Security
The Register - Security
量子位
AWS News Blog
AWS News Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
Tor Project blog
Schneier on Security
Schneier on Security
博客园 - 【当耐特】
M
MIT News - Artificial intelligence
V
Visual Studio Blog
Vercel News
Vercel News
Malwarebytes
Malwarebytes
G
GRAHAM CLULEY
P
Palo Alto Networks Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
H
Help Net Security
V
V2EX - 技术
Jina AI
Jina AI
李成银的技术随笔
aimingoo的专栏
aimingoo的专栏
S
Security @ Cisco Blogs
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
IT之家
IT之家
I
InfoQ
Cyberwarzone
Cyberwarzone
小众软件
小众软件
Blog — PlanetScale
Blog — PlanetScale
S
Secure Thoughts
Simon Willison's Weblog
Simon Willison's Weblog

DEV Community

Should you use Gemma 4 for your Development? A Multiversal Analysis to Determine if Gemma 4 is Right for You! The Rising Trend of Creative Interview Questions in Tech It Worked When I Closed the Laptop. I Swear. We Built an Agent That Flags Fake Internships #kryx Your Personal AI Stack Is the New Dotfiles How We Prevent Attendance Fraud Using GPS Verification AI Code Review in 2026: How the Tools Actually Differ (A Builder's Field Guide) From Problems to Patterns: Generative AI in .Net (C#) GemmaOps Edge: From 373 Alarms to 1 Root Cause Using Local AI (Gemma 4) Building an Amazon EKS Security Baseline Hands-On with Apache Iceberg Using Dremio Cloud 🤫 Firebase Is Quietly Preparing for an Offline-First AI Future Should Angular Apps Still Rely on RxJS in 2025? Gaslighting Gemma 4: Can Open-Weight Reasoning Models Withstand a Confident Liar? AI Workflow Automation Needs More Than Another Script Reviving Cineverse: From Local Storage to Firebase 🚀 Approaches to Streaming Data into Apache Iceberg Tables How to Add Rounded Corners to an Image Online The subtle impact of AI (&amp; IT) on jobs Made a Rust based AI agent Your AI is not bad, your instructions are What Clicked for Me After Building on Solana for a Few Days WhatsApp's Encryption Stack: What It Covers, What It Doesn't, and What a Federal Agent Spent 10 Months Investigating Building CogniPlan: A Local-First Task Planning System Using Apache Iceberg with Python and MPP Query Engines How I Built AegisDesk: A Zero-Token Semantic IT Agent with <5ms Latency I built CodeArchy: an open-source that turns any codebase into a visual, explainable architectural experience, powered by Gemma 4. The Day Our Bot Ran Out of Money How we're using Gemini Embeddings to build a smarter, community-driven feed on DEV The Speculative Decoding Pattern The PKCE "Gotcha" in Expo’s exchangeCodeAsync TharVA : Keeping India's Desert Heritage Alive with Offline AI (Gemma4) n8n for Healthcare: 5 Automations for Clinics, Practices, and Health Tech Teams (Free Workflow JSON) How I Built an OWASP Memory Guard for AI Agents (ASI06) Condition-Based vs Time-Based Maintenance: Making the Switch I Tested Spam Protection on Formspree vs Formgrid. The Results Were Surprising. May 27 - Video Understanding Workshop Beyond Keywords: How Google's 2026 Algorithms are Redefining SEO From Click to Cart: Ensuring an Accessible Customer Journey in WooCommerce Your company won't replace you with good AI. They'll replace you with bad AI. How to Use an SVG Icon Search Engine as a Claude Custom Connector O fim do “modelo que faz tudo”? Conheça o Conductor, a IA que orquestra outras IAs 10 First-Principles Strategies to Learn Any Programming Language Deeply 10 First-Principles Strategies to Learn Any Programming Language Deeply Understanding Embeddings easily. The Hidden Cost of “Move Fast and Break Things” Why Your Logs Are Useless Without Traces DressCode: Your AI Stylist for Tomorrow The Documented Shortcoming of Our Production Treasure Hunt Engine I'm 16, and I Built an AI Tool That Audits Your Technical Debt Without Ever Touching code Building Your Own Crypto Poker Bot: A Developer's Guide to Blockchain Gaming Logic Apache Iceberg Metadata Tables: Querying the Internals Hermes, The Self-Improving Agent You Can Actually Run Yourself Unity vs Unreal: 5 Things I Had to Relearn the Hard Way Building Agentic Commerce Infrastructure: Overcoming SQLite Concurrency for Autonomous Procurement Agents Solana Accounts vs Databases HTML Table Borders I built a skill that makes AI-generated AWS diagrams actually usable My first post! I'm kinda excited The Page Root Was the Wrong Unit How to audit what your IDE extension actually sends to the cloud I Migrated 23 Make.com Scenarios to n8n and Cut My Bill by 60% — Complete Migration Guide (2026) Solving a Logistics Problem Using Genetic Algorithms Claude Code Skills Explained: What They Are & When to Use Them (2026) Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup Zero-Idle Local LLMs: Running Llama 3 in AWS Lambda Containers We scanned 8 B2B SaaS companies across 5 categories. ChatGPT named the same 12 brands in every answer. How To "Market" Yourself As A Tech Pro We scanned 500 MCP servers on Smithery. Here is what we found. HTML Basics for Beginners – Markup Language, Elements and Types of CSS DiffWhisperer: How I Turned Cryptic Git Diffs into Architectural Stories with Gemma 4 I built a version manager for llama.cpp using nothing but vibe coding. Unit Testing vs System Testing: Key Differences, Use Cases, and Best Practices for 2026 A game design textbook explains why products with fewer features win How to Build a Raydium Launchpad Bonding Curve in 5 Minutes with forgekit How to turn an AI prototype into a production system How Data Lake Table Storage Degrades Over Time Partition and Sort Keys on DynamoDB: Modeling data for batch-and-stream convergence Auto-Generate Optimized GitHub Actions Workflows For Any Stack With This New CLI Tool Unchaining the African Creator Economy The Treasure Hunt Engine Gotcha - A Lesson in Constrained Performance great_cto v2.17 - no more tambourine dance When Catalogs Are Embedded in Storage SafeMind AI: Instant Health & Safety Intelligence What Is PKCE, How It Works & Flow Examples AI Agent Failure Modes Beyond Hallucination Fastest Way to Understand Stryker Solana Accounts Explained to a Web2 Developer TV Yayın Akışı Sitesi Geliştirirken Öğrendiğim Teknik Dersler $500 Challenge Drop My First Look at Google's Gemma 4: A Quick Introduction How I use an LLM as a translation judge Best Calendar and Scheduling API for Developers — 2026 Comparison Agentic AI in Travel: Why UCP Isn't Travel-Ready Yet — and What We Measured I Finished Machine Learning. And Then Changed The Plan. The Five-Thousand-Line File The AI Whirlwind: Why Your Local Agent Matters More Than Ever I Built an Oracle DBA That Lives in Telegram. It Cut a 500K-Row Scan to 5 - After Asking Permission. The Day 2 Reality of Running a Kubernetes Lab on Your Mac: Stop/Start, CKS Scenarios, and What I Learned Building It. n8n for Airtable Power Users: 5 Automations That Take Your Base to the Next Level
Your LLM Bill Is Exploding Because of Architecture, Not Pricing -- Here's the Fix
Ismail Haddo · 2026-05-23 · via DEV Community

LLM per-token prices fell between 9x and 900x over the past year. Yet most teams running agentic AI in production are seeing their API bills go up, not down. Here is exactly why, and the three code-level interventions that cut spend 60-80% without touching quality.

Why Agentic Workloads Break Your Token Budget

A chatbot interaction: 1 LLM call, ~3,000-10,000 tokens. Done.

An agentic task: plan the approach, call a tool, process results, decide next step, call another tool, validate output, loop if needed. That is 10-20 LLM calls, each carrying the growing context window from all previous steps. By step 8, you may be passing 60,000 tokens into every call -- most of it noise.

The math: agentic workflows burn 5-30x more tokens per completed task than a standard chatbot exchange. A 10x price drop combined with a 20x token increase means your bill doubled.

There are three places the money leaks.


Leak 1: Context Bloat -- Fix with Compression

Most agentic pipelines append every step's output to a running context that gets passed to every subsequent LLM call. By step 6, you are paying full price to send the model information from step 1 that is no longer relevant.

Before passing context to any LLM call, compress it:

from anthropic import Anthropic
client = Anthropic()

def compress_context(conversation_history: list[dict], current_task: str,
                     token_budget: int = 20000) -> list[dict]:
    """
    Compress older turns if context exceeds budget.
    Keeps recent turns intact, summarizes older ones.
    """
    raw_tokens = sum(len(str(m)) // 4 for m in conversation_history)
    if raw_tokens <= token_budget:
        return conversation_history

    recent = conversation_history[-3:]
    older = conversation_history[:-3]
    if not older:
        return recent

    summary_prompt = f"""Summarize the following conversation history into 2-3 sentences,
keeping only information relevant to: {current_task}

History: {older}"""

    summary = client.messages.create(
        model="claude-haiku-4-5-20251001",  # cheap model for summarization
        max_tokens=300,
        messages=[{"role": "user", "content": summary_prompt}]
    ).content[0].text

    compressed = [{"role": "system", "content": f"[Earlier context summary]: {summary}"}]
    compressed.extend(recent)
    return compressed

# Before any LLM call in your agent loop:
context = compress_context(conversation_history, current_task="validate invoice fields")
response = client.messages.create(model="claude-sonnet-4-6", messages=context, max_tokens=1000)

Enter fullscreen mode Exit fullscreen mode

This alone typically reduces context size by 50-70% in long-running agentic workflows.


Leak 2: Frontier Model Overuse -- Fix with Model Routing

Using a frontier model for every step in your pipeline is like hiring a principal engineer to sort your email. Most agent steps -- classification, format conversion, simple lookups, routing decisions -- work fine with a small, fast, cheap model.

from enum import Enum
from dataclasses import dataclass

class TaskComplexity(Enum):
    SIMPLE = "simple"
    MEDIUM = "medium"
    COMPLEX = "complex"

@dataclass
class ModelConfig:
    model: str
    cost_per_1k_input: float
    cost_per_1k_output: float

MODEL_TIERS = {
    TaskComplexity.SIMPLE: ModelConfig("claude-haiku-4-5-20251001", 0.00025, 0.00125),
    TaskComplexity.MEDIUM: ModelConfig("claude-sonnet-4-6", 0.003, 0.015),
    TaskComplexity.COMPLEX: ModelConfig("claude-opus-4-6", 0.015, 0.075),
}

def classify_task(task_description: str) -> TaskComplexity:
    simple_keywords = ["classify", "categorize", "is this", "format", "convert", "route", "label"]
    complex_keywords = ["analyze", "reason", "debug", "design", "plan", "evaluate", "compare"]
    task_lower = task_description.lower()
    if any(kw in task_lower for kw in simple_keywords):
        return TaskComplexity.SIMPLE
    elif any(kw in task_lower for kw in complex_keywords):
        return TaskComplexity.COMPLEX
    return TaskComplexity.MEDIUM

def routed_llm_call(task: str, messages: list[dict]) -> tuple[str, float]:
    complexity = classify_task(task)
    config = MODEL_TIERS[complexity]
    response = client.messages.create(model=config.model, max_tokens=1000, messages=messages)
    input_tokens = response.usage.input_tokens
    output_tokens = response.usage.output_tokens
    cost = (input_tokens / 1000 * config.cost_per_1k_input +
            output_tokens / 1000 * config.cost_per_1k_output)
    return response.content[0].text, cost

Enter fullscreen mode Exit fullscreen mode

In most production pipelines, 70-80% of steps classify as SIMPLE or MEDIUM. Routing those to cheaper models cuts your average cost per task by 60-70%.


Leak 3: Redundant Calls -- Fix with Semantic Caching

Your agentic system is probably making the same LLM calls repeatedly. Different phrasing, same semantic content. Standard caching misses these. Semantic caching embeds the query and retrieves cached results for near-matches.

import numpy as np
from datetime import datetime, timedelta

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92, ttl_hours: int = 24):
        self.cache: list[dict] = []
        self.threshold = similarity_threshold
        self.ttl = timedelta(hours=ttl_hours)

    def _embed(self, text: str) -> list[float]:
        # Replace with real embedding model in production
        import hashlib
        seed = int(hashlib.md5(text.encode()).hexdigest(), 16) % (2**32)
        return np.random.RandomState(seed).randn(1536).tolist()

    def _cosine_similarity(self, a, b) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def get(self, query: str) -> str | None:
        query_embedding = self._embed(query)
        now = datetime.utcnow()
        for entry in self.cache:
            if now - entry["timestamp"] > self.ttl:
                continue
            if self._cosine_similarity(query_embedding, entry["embedding"]) >= self.threshold:
                return entry["response"]
        return None

    def set(self, query: str, response: str):
        self.cache.append({
            "query": query,
            "embedding": self._embed(query),
            "response": response,
            "timestamp": datetime.utcnow()
        })

Enter fullscreen mode Exit fullscreen mode

Production deployments with repetitive enterprise workloads typically see 30-50% cache hit rates -- eliminating a third to half your API calls entirely.


Putting It Together: Cost Tracking Per Step

None of this works without measurement. Add per-step cost tracking to your agent loop:

from dataclasses import dataclass, field
import time

@dataclass
class AgentStep:
    name: str
    model: str
    cache_hit: bool
    cost_usd: float
    duration_ms: float

class CostAwareAgentRunner:
    def __init__(self):
        self.steps: list[AgentStep] = []
        self.cache = SemanticCache()

    def run_step(self, name: str, task: str, messages: list[dict]) -> str:
        start = time.time()
        cached = self.cache.get(task)
        if cached:
            self.steps.append(AgentStep(name, "cache", True, 0.0, (time.time()-start)*1000))
            return cached

        response_text, cost = routed_llm_call(task, messages)
        self.cache.set(task, response_text)
        self.steps.append(AgentStep(
            name, classify_task(task).value, False, cost, (time.time()-start)*1000
        ))
        return response_text

    def cost_report(self) -> dict:
        total = sum(s.cost_usd for s in self.steps)
        hits = sum(1 for s in self.steps if s.cache_hit)
        return {
            "total_cost_usd": round(total, 6),
            "steps": len(self.steps),
            "cache_hit_rate": hits / len(self.steps) if self.steps else 0,
            "by_step": [{"name": s.name, "cost": s.cost_usd, "model": s.model} for s in self.steps]
        }

Enter fullscreen mode Exit fullscreen mode

Once you have this instrumentation, the top three steps by token consumption almost always account for 60-70% of total spend. That tells you exactly where to focus.

A logistics client: $40K/month in LLM API costs, down to under $12K after model routing + semantic caching + context compression. Same volume, same quality. Frontier model performed better on complex steps because it was receiving cleaner, more focused context.


If you are hitting this in production and want a second set of eyes, feel free to DM me -- happy to dig in.