惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
博客园_首页
H
Hackread – Cybersecurity News, Data Breaches, AI and More
T
ThreatConnect
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - 聂微东
H
Help Net Security
T
Threat Research - Cisco Blogs
Blog — PlanetScale
Blog — PlanetScale
A
Arctic Wolf
G
Google Developers Blog
量子位
U
Unit 42
I
InfoQ
V
V2EX
F
Fox-IT International blog
P
Privacy & Cybersecurity Law Blog
V
Visual Studio Blog
J
Java Code Geeks
大猫的无限游戏
大猫的无限游戏
C
CERT Recently Published Vulnerability Notes
博客园 - 三生石上(FineUI控件)
T
The Exploit Database - CXSecurity.com
T
Tailwind CSS Blog
SecWiki News
SecWiki News
Know Your Adversary
Know Your Adversary
MyScale Blog
MyScale Blog
宝玉的分享
宝玉的分享
The Hacker News
The Hacker News
Project Zero
Project Zero
Application and Cybersecurity Blog
Application and Cybersecurity Blog
月光博客
月光博客
Recent Commits to openclaw:main
Recent Commits to openclaw:main
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
G
GRAHAM CLULEY
C
Cisco Blogs
I
Intezer
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
Recorded Future
Recorded Future
T
Tenable Blog
W
WeLiveSecurity
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
T
The Blog of Author Tim Ferriss
www.infosecurity-magazine.com
www.infosecurity-magazine.com
D
Docker
C
Cybersecurity and Infrastructure Security Agency CISA
PCI Perspectives
PCI Perspectives

DEV Community

I built a local MCP server that gives Claude Code real PR context — 33s reviews instead of 90s How I built AgentRAM: a memory API for AI agents without a vector DB AI, Pig Butchering, and the New Frontier of Scams: Why Scammers Are Becoming Developers Journey Begins: Google Cloud Get Certified Program Edition 2 (2026) I Vibe-Coded an App in a Weekend. Three Weeks Later I Couldn't Explain It. Beyond Strict Mode: 5 Advanced TSConfig Settings for Bulletproof TypeScript The bug I kept seeing in math practice: right answers that were too slow gotracer: Turn Go Execution Traces into Actionable Findings Forget Python: Why PHP is the Real Future of AI for the Web Stop Reinventing the Wheel: 5 Hidden Gems in PrestaShop's Tools.php File AI Tools & Products Radar — May 28, 2026 New Benchmark Reveals Hidden Trade-offs in AI Model Tuning Methods What I Learned Building My First Chrome Extension for Google Calendar Trider – The AI Habit Tracker That Actually Gets You (Free, No Ads) 4 Best AI TTS APIs in 2026 Claude Opus 4.8: What Developers Need to Know About Anthropic's New Flagship Claude Opus 4.8: What Developers Need to Know About Anthropic's New Flagship Full Stack Developer Looking for Internship Opportunities How Microservices Talk to Each Other Using WebClient After burning through tens of billions of tokens, I built an Android-like OS that runs entirely in the browser The PrestaShop Modules "Jungle": An Unexpected Opportunity for Your Site? I Ship One AI Testing Feature Every Day — Here's What 6 Days Looks Like Only 2 of 128 YC-backed dev tools companies block unchecked merges Read environment variables from .env file in Angular PrestaShop Added an AI Onboarding System Directly to Its Repo The AI Control Plane Is Becoming the New Shadow IT How-To Spec-Driven AI Development Veltrix Events Were a Disaster Until We Fixed One Crucial Thing Phone-as-keyboard for any USB host — building a driverless HID bridge PrestaShop Development: Is Documentation Really the Problem? Python List Methods Explained Simply (Add, Remove, Sort) Impostor Syndrome in Tech - The Honest Version Nobody Posts About I Built a Tool to Stop Guessing LLM API Costs. Here Is What I Learned. Constraint Decay: Why Your AI Coding Agent Passes Tests But Breaks Production KairoDB-Human-Readable Databases Your best pull request could be a -500 (and that's seniority) I Built a Terminal Typing App Because I Was Tired of Leaving My Terminal Sending SMS from AWS Lambda Markdown to PDF: 8 methods compared (and why most of them disappoint) Coordinar deploys de frontend y backend sin orquestado, usando Github Actions I had to restore an entire database just to recover one deleted row The Sovereign Vault: Building High-Integrity AI with MCP & Local Vision I Built a Lightweight Python RAG Orchestrator That Works with SQLite, PGVector and Qdrant Redis — The Engine of Instant Gratification The Project I Couldn’t Finish 2 Years Ago - Notebook for ChatGPT Less Greedy Code, Less Misery: The Power of SRP Through a Battle-Tested Lens Which Cloud Is Best for Containers & Microservices? Why IBM Cloud Stands Out Modern css kills js 15 AI Coding Hacks Nobody Talks About (2026) Your AI Agents Need an Architecture, Not Just a Prompt AI coding assistants are making juniors worse and seniors lazier AI can generate HTML. Publishing it is still weirdly annoying. Shopify vs Magento for AI Commerce in 2026: Platform-Mediated vs Merchant-Controlled AEO I scanned Langfuse. It observes its own LLM calls through its own platform. Prompt caching in production: the 4 patterns that cut my Anthropic bill (and when not to bother) Why Does My Android Camera Stop Recording When the Screen Turns Off? Doze, WorkManager, and the Right Way to Build a Foreground Service We patched Chromium with 49 C++ hooks to beat Cloudflare — here's how BrowserHand works I Replaced 30 Minutes of Daily Browser Chores with One Cron Job Rename a Kubernetes PVC Without Losing Your Data: PersistentVolume Rebinding A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails Architecture of Chaos Part 4 (Finale) — Split-Brain Surgery, Chaos Engineering, and Shipping to Production The Road to KiwiEngine — The Strange Feeling of Publishing Your Own Ecosystem Day 93: Bridging React to iOS Widgets and Face ID The Hidden Cost of Complex AI Platforms: Why Developer Experience Matters Running FreeIPA on Ubuntu Using Podman – Part 2: Step-by-Step Deployment In 2026, you can just prompt your way to a working Android app. 🤯 Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside) OpenSparrow v2.6 – AI-powered search (RAG), bulk operations, and keyboard shortcuts The New Shape of Supply-Chain Trust Why Analytics Is Product Infrastructure The Fallacies of GenAI Development Stop Building AI Assistants. Build AI Firewalls. I built a "what is my IP" site because I was tired of the ugly ones How to Stop Your AI Agent Before It Does Something You Can't Undo I Just Wanted to Scrape One Page. Why Did I Write 50 Lines of Puppeteer? Amazon STAR Method 2026: The Complete Cheat Sheet (30+ Questions + Scored Examples) Building a Japanese-First Read-Later PWA: From Pocket Shutdown to Launch How to show weather on your personal website in 3 lines of JavaScript (no API key needed) Building user-customizable themes with Tailwind CSS I turned an abandoned Go project into a full terminal Arcade Game Part 2 of 4: Building a Real k6 Test Suite Against a Live Kubernetes App How I structured 12 Flutter paywall screens to share the same purchase logic I Added a Live Dashboard to My LLM Proxy. Zero Instrumentation. Just a URL Change. Free Security Audit API: Scan Your Code in 30 Seconds I Built an Uncensored AI Chatbot With a Mystical Sphinx Persona Agent memory poisoning. The 4-stage enterprise damage chain. 18 developer tools I use to improve my workflow I Found a Free Domain Platform Built by an 18-Year-Old — and It Actually Works Why smart contract deployment still needs better infrastructure Navigating Layoffs: A Comprehensive Guide for Professionals How to Track Website Visitors Without Cookies in 2026 Building a no-signup PDF toolkit with 32 small file tools How to Optimize Images for Website Speed in 2026 (Without Losing Quality) Mastering CSS Grid Subgrid: A Complete Guide ffmpeg-ai: A Free CLI That Turns a Prompt Into a Finished YouTube Short ECS + FARGATE + CONTAINERIZATION + OBSERVABILITY + PRODUCTION ARCHITECTURE Microsoft Told Engineers to Ease Off Claude Code Evolution of Developer Skills Beyond the Cheat Sheets: How to Actually Reason About Partitioning VS Sharding in System Design Interview AI Coding Agents Search Like It's 2009. Provenant Cuts Tokens by 65 .
Feeding Raw HTML to Your LLM Is a Token Tax. I Measured It on 10 Real Pages — Median 7.4 , and It Hits Every Scheduled Run
Alex Spinov · 2026-05-29 · via DEV Community

There's a whole wave of posts right now telling you the same thing: don't feed raw HTML to your LLM, convert it to markdown first, it's more token-efficient. AlterLab has five of them. There's a popular one called "Raw HTML is where LLM context goes to die." A blog called searchcans says converting HTML to markdown cuts tokens by "20-30%."

I read most of them. Here's what bugged me.

Not one shows a number you can reproduce. "More efficient" — how much more? "Our benchmarks show" — which benchmark, which pages, which tokenizer? searchcans says 20-30%. The "context goes to die" post is a vibe, not a measurement. AlterLab's posts say markdown wins and then sell you a managed extraction API. None of them hand you a script and say run it on your own pages and see.

So I did the boring thing. I wrote a 30-line meter, pointed it at ten real public pages, and counted tokens. Raw HTML versus the text you actually wanted. The median multiplier was 7.4×. The spread was wild — from 1.1× on a stripped-down test page to 47.8× on a news homepage. And the part nobody talks about: that multiplier doesn't hit you once. It hits every single scheduled run.

I've got 2,190 production scraper runs behind me. So this isn't an abstraction for me — a multiplier on a recurring job is a recurring bill.

Let me show you the whole thing. Code, numbers, the math, and the one place where "just use markdown" is only half true.

The meter (copy it, run it on your pages)

No magic. requests to fetch, BeautifulSoup to pull text out, tiktoken to count. The only thing that matters is that you count tokens with the same encoder for both sides, so the ratio is honest.

#!/usr/bin/env python3
# token-cost meter: raw HTML vs extracted text, in tokens.
# stdlib + requests + bs4 + tiktoken. cl100k_base (GPT-4o-era tokenizer).
import statistics
import requests
from bs4 import BeautifulSoup
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

URLS = [
    "https://en.wikipedia.org/wiki/Web_scraping",
    "https://docs.python.org/3/library/json.html",
    "https://www.rfc-editor.org/rfc/rfc9110.html",
    "https://news.ycombinator.com/",
    "https://www.gnu.org/licenses/gpl-3.0.en.html",
    "https://httpbin.org/html",
    "https://blog.python.org/",
    "https://developer.mozilla.org/en-US/docs/Web/HTTP/Status",
    "https://www.bbc.com/news",
    "https://quotes.toscrape.com/",
]
H = {"User-Agent": "token-cost-meter/1.0 (+research; bs4+tiktoken)"}

def n(s):  # tokens
    return len(enc.encode(s))

rows = []
print(f"{'url':<50} {'tok_raw':>9} {'tok_text':>9} {'mult':>6}")
for url in URLS:
    r = requests.get(url, headers=H, timeout=20)
    if r.status_code != 200 or not r.text:
        print(f"{url[:48]:<50} {'SKIP — ' + str(r.status_code):>26}")
        continue
    raw = r.text
    text = BeautifulSoup(raw, "html.parser").get_text(" ", strip=True)
    t_raw, t_txt = n(raw), max(1, n(text))
    rows.append((t_raw, t_txt, t_raw / t_txt))
    print(f"{url[:48]:<50} {t_raw:>9} {t_txt:>9} {t_raw/t_txt:>6.1f}")

mults = [m for *_, m in rows]
print(f"\nmedian multiplier: {statistics.median(mults):.1f}x")
print(f"min / max:         {min(mults):.1f}x / {max(mults):.1f}x")

Enter fullscreen mode Exit fullscreen mode

That's it. Thirty-odd lines. You can swap in your own URLs and have your own number in five minutes, which is more than any of the "markdown beats HTML" posts gave you.

What it actually said

Here's the raw output from my run. Every one of these ten URLs returned HTTP 200 the day I ran it (2026-05-28); none were skipped.

url                                                  tok_raw  tok_text   mult
https://en.wikipedia.org/wiki/Web_scraping             48975      6658    7.4
https://docs.python.org/3/library/json.html            32245      6271    5.1
https://www.rfc-editor.org/rfc/rfc9110.html           378840    110723    3.4
https://news.ycombinator.com/                          11794      1150   10.3
https://www.gnu.org/licenses/gpl-3.0.en.html           12502      7882    1.6
https://httpbin.org/html                                 886       836    1.1
https://blog.python.org/                               11279       465   24.3
https://developer.mozilla.org/.../HTTP/Status          62411      4649   13.4
https://www.bbc.com/news                              112721      2356   47.8
https://quotes.toscrape.com/                            2880       388    7.4

median multiplier: 7.4x
min / max:         1.1x / 47.8x

Enter fullscreen mode Exit fullscreen mode

Median 7.4×. So the headline number I'll stand behind is: feeding raw HTML to a model costs you, on a typical page, around seven times what feeding the extracted text would.

But honestly, the median is the least interesting line in that table. Look at the spread.

The httpbin.org/html page is 1.1×. It's a deliberately minimal test page — almost all text, almost no markup. The wikipedia article is 7.4×, right on the median. And then the BBC news homepage is 47.8×. Forty-seven times. That page is 112,721 tokens of HTML wrapping 2,356 tokens of text. Ninety-eight percent of what you'd send the model is scaffolding — scripts, inline JSON state blobs, tracking, navigation, the same boilerplate stamped around every card.

So when someone says "HTML is roughly N× more tokens," the honest answer is: N depends entirely on what you're scraping. A clean documentation page? Maybe 3–5×. A modern, ad-heavy, JavaScript-state-dumped news or e-commerce page? You're easily in double digits. Don't trust a single number — and especially don't trust mine. Run the meter on the actual pages you scrape. That's the whole point.

One caveat I want to be upfront about, because it's the kind of thing the vendor posts gloss over. I counted with cl100k_base, the GPT-4o-era tokenizer. Newer models (the GPT-5.x family) use a different encoder. Does that move the multiplier? Barely — because the multiplier is a ratio, and both the numerator (HTML) and the denominator (text) get encoded by the same tokenizer. Switching encoders shifts both sides together. The absolute token counts move; the ratio mostly doesn't. I'd put it at ±15% across tokenizers, but I'll be straight with you — I couldn't load o200k_base in my environment to prove that side by side, so treat the ±15% as my estimate, not a measurement.

The part the "use markdown" crowd skips

Here's where I think the whole wave gets it half-right and stops.

The pitch is "convert HTML to markdown and your tokens shrink." True. But it quietly implies the hard part is the format. It isn't. The hard part is what you keep.

BeautifulSoup.get_text() already strips every tag. So in a sense it's already doing what markdown extraction does for token count — tags gone, text out. But that text still drags along the menu, the cookie banner copy, the footer links, the "related articles" rail, the newsletter nag. All of it is real text. All of it costs tokens. None of it is the thing you scraped the page for.

So I ran a second measurement. Same pages, but this time I dropped the obvious boilerplate tags — script, style, nav, header, footer, aside, noscript, svg, formbefore extracting text.

BOILER = ["script", "style", "nav", "header",
          "footer", "aside", "noscript", "svg", "form"]
soup = BeautifulSoup(raw, "html.parser")
for tag in soup(BOILER):
    tag.decompose()
clean = soup.get_text(" ", strip=True)

Enter fullscreen mode Exit fullscreen mode

The result:

page                         get_text()   + drop boilerplate
wikipedia /Web_scraping       6,658 tok    6,234 tok   (-6%)
bbc.com/news                  2,356 tok    1,607 tok   (-32%)
mozilla /HTTP/Status          4,649 tok    2,703 tok   (-42%)

Enter fullscreen mode Exit fullscreen mode

On the MDN docs page, naive text extraction left 42% extra tokens on the table. Forty-two percent — after I'd already thrown away the HTML. The boilerplate that survives the tags is the expensive part. On the BBC homepage, another 32% gone.

That's the bit the markdown posts don't mention, because it complicates the sales pitch. "Just convert to markdown" gets you the 7× win. Getting the rest is about deciding what counts as content on your pages, and that's site-specific work no managed API does perfectly for you. The format was never the hard problem. The judgment is.

Now multiply it by every run

Here's why I care more than someone running this once as a curiosity.

A multiplier on a single page is a rounding error. You think: it's a few thousand tokens, who cares. That's exactly what I'd have thought before I'd watched a scraper run on a schedule for a year.

I've put 2,190 runs through 32 published actors on Apify — that's a raw lifetime counter off my own dashboard (apify.com/knotless_cadence, as of May 2026, not a sample). My Trustpilot review scraper alone is at 962 runs. The thing about production scraping that nobody warns you about: the work isn't the first run. It's the same job firing again tomorrow, and the day after, and the day after that. A scheduled scraper is a tax that compounds.

So let's do the arithmetic with the median 7.4× and real prices.

Take a modest pipeline: 500 pages per run, one run a day. Say the extracted text is about 4,000 tokens per page (close to the median text size in my run). At 7.4×, the raw HTML is ~29,600 tokens — so you're shipping ~25,600 wasted tokens per page if you feed HTML instead of text.

  • 25,600 wasted tokens/page × 500 pages × 30 runs = 384 million wasted input tokens a month.

Now price it. From OpenAI's API pricing page (developers.openai.com/api/docs/pricing, retrieved 2026-05-28 — note GPT-4o is no longer listed there; the current lineup is the GPT-5.x family), input token prices are: gpt-5.4 at $2.50 / 1M, gpt-5.4-mini at $0.75 / 1M, gpt-5.4-nano at $0.20 / 1M. (Prices change — re-check before you quote them.)

  • gpt-5.4: 384M × $2.50/1M = $960/month burned on markup = ~$11,500/year.
  • gpt-5.4-mini: $288/month = ~$3,460/year.
  • gpt-5.4-nano: $77/month = ~$922/year.

And to be clear about what I'm not claiming: this is the token cost only. I'm not telling you what a scraper run costs me in compute or storage — I'm not going to make up infrastructure numbers I can't back. The multiplier is measured. The prices are public. The pages-per-run and runs-per-month are yours to plug in. I gave you a calculator, not a verdict on your bill.

But notice the shape of it. The per-page waste — $0.06 on gpt-5.4 — feels like nothing. Multiply by your real volume and your real schedule and it's a salary line. That's the gap between "I ran a token counter once" and "I run scrapers in production." A one-time multiplier is trivia. A multiplier on a cron schedule is a budget.

What I'd actually do

Not a framework, not a managed API. Three habits.

Strip before you tokenize, not after. get_text() on raw HTML is the lazy default and it leaves 6–42% on the table depending on the page. Drop script/style/nav/footer first. It's four lines.

Measure your own pages, not someone's blog claim. My median is 7.4×; yours could be 3× or 30× depending on whether you scrape clean docs or bloated SPAs. The meter is right there. Spend the five minutes.

Watch the schedule, not the page. The decision that matters isn't "HTML or markdown for this one page." It's "what am I sending the model, times how many pages, times how many runs." Optimize the thing that repeats.

I ran the meter expecting a tidy 10× to confirm the headline I had in my head. I got a 7.4× median and a 47.8× outlier that taught me more than the median did. The number isn't the lesson. The spread is — and the fact that whatever your number is, you pay it again every night.

Written by Alexey Spinov. I run production scrapers — 2,190 lifetime runs across 32 actors, Trustpilot scraper at 962 of them (apify.com/knotless_cadence). The meter and both measurements in this post were run live on 2026-05-28; every URL returned 200, and the output blocks are copied straight from the run, not cleaned up.

Disclosure: I drafted this with help from an AI writing assistant. The code, the measurements, and the numbers are mine — run on real pages, not generated.

Paying an LLM bill that's mostly HTML markup, or running a scraper on a schedule that's quietly compounding? I've put 2,190 runs through production and I count the tokens before they hit the model. If you want a scraping-to-LLM pipeline that ships clean text instead of scaffolding, tell me what you're scraping and I'll scope it — spinov001@gmail.com.