惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

Claude Code for Canary Deployments: How I Ship to 1% of Users Before Breaking Everything Espressif Reveals CoreBoard and Korvo Dev Kits for ESP32-S31 My CKA Cheat Sheet: Commands, Aliases, and Documentation Tricks I Used During the Exam Frontend Engineering Beyond Pixels: The Architecture of Digital Accessibility VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner Fabric AI Functions Turn GenAI Into a Data Pipeline Step Proximate vs Ultimate: The Bug Is Never Just the Bug The Treasure Hunt Engine That Broke Before the Traffic Did Reset Windows Update: The Definitive MSP Guide to RWU Your Resume Was Never Built for This AI Writes 46% of Code Now: What Snap's Layoffs Mean for Developers in 2026 From Chatbot to Agent — Tool Calling with NVIDIA NIM Fatigue and Fracture Mechanics: Why Parts Break Below Their Yield Strength I built a token-level debugger for comparing two LLMs VCP-Virtual Private Cloud Embedding sing-box in an iOS messenger to bypass Russian DPI (no VPN) Microsoft Copilot just exfiltrated a company's files. The attack was one email. Here's the mechanism. RAG 시스템 실전 구축 (v42) copilot cloud agent is becoming an automation api Cx Dev Log — 2026-04-23 Why Tesla Is Becoming the AI Enterprise Case Study Every Leader Should Understand ORA-00214 오류 원인과 해결 방법 완벽 가이드 SpecAgnt v2.0: The Agent Lifecycle Framework for AI-Native Engineering Optimizing Signal Latency and Weight Allocations in Algorithmic Pipelines SSH Under the Hood: Protocols, Mechanisms, and the Full Technical Story دليل بوابات الدفع للتاجر العربي في 2026 (وكيف تختار المناسبة لمتجرك) Cómo Mi Configuración de Docker Me Salvó de un Ataque de Supply Chain (Y Por Qué la Tuya Debería Hacerlo También) How My Docker Setup Saved Me From a Supply Chain Attack (And Why Yours Should Too) Astro: The epitome of SEO Technical Update I Gave My AI Agent the Ability to Research Before It Writes — Here’s What Changed Kubernetes sem Cloud Provider (Parte 2): Criando Operators em Go para automação e self-service de plataforma AI Memory Needs an Authority Policy, Not Just More Context You've done tutorial after tutorial. Your GitHub is still empty. (Free 1‑page PDF, no signup) TypeScript 7.0: The Go Compiler That Makes TS 10x Faster Connecting Wallets the Right Way: wagmi v2 and EIP-6963 The 5-Layer Architecture Every Production Multi-Agent System Needs (And Why Most Skip Layers 4 and 5) CSS Scroll-Driven Animations: No JavaScript Required Vite 8 + Rolldown: Rust-Powered Builds That Are 10–30x Faster Core Architectural Components of Azure My Skills How I Use AI as a Senior Engineer Construí um motor ATS determinístico porque estava cansado de adivinhar por que meu currículo era rejeitado SCS-Lab1 — CloudTrail: Trail + S3 + KMS + Log Validation LuisCore MCP server — daily syndication · 2026-05-25 Cursor vs JetBrains Rider for C#/.NET in 2026: which to pay for I built a local-first movie recommender with Corrective-RAG (cited explanations, hybrid retrieval, runs entirely on Ollama) Scaling to 1 Million Users : Load Balancing & Caching Strategies How the Events Table That Looked Right Killed Our Queue Three Failures My AI Memory System Caught — And the Flaw It Revealed in Itself dotnet Framework life cycle tool LangGraph 워크플로우 템플릿 (v41) I built a free image compression API — no signup, just curl Designing TikTok from Scratch — A System Design Deep Dive PREDICTION-20260525-0007: boredom-with-asymmetric-leverage [2026-Q3 through 2027-Q3] [Boost] How to integrate the QuickBooks Invoice API in 2026 How I Cut My Anthropic API Bill by 50% With a Local Python Tool Vibe Coding Problems: 7 Visual Bugs AI Code Generators Always Ship Chinese AI Models 2026: The Agentic Revolution, Hardware Independence, and What It Means for Global Developers The Quiet AI War Inside Your Browser The 12-Line Anti-Bot Trick That Saved Our Airdrop Snapshot From Sybil Farms Building a production-ready SaaS dashboard in Next.js 16 — Recharts, TanStack Table, dark mode, and collapsible sidebar Why 2026 Belongs to Agentic AI (And How to Build Your First Local Agent) It Was 2024 When We Tried to Outsmart the Treasure Hunt Engine RAG 시스템 실전 구축 (v40) I Found a Tool That Generates a Complete .NET 8 or Java Spring Boot API From SQL Schema in 30 Seconds I Added a 4th Agent That Audits My Other Agents. It Caught My Strategist Procrastinating for 3 Weeks. Streaming LLM responses to the browser in Go (Server-Sent Events) How We Publish and Manage Educational Admission Updates at Scale on DailyAxom A prompt is not a conversation. It's a component contract. How to Pass the EAA 2025 Accessibility Audit — A Step-by-Step WCAG Checklist Building an Autonomous MCP Lead Generation System with Hermes Agent LangGraph 워크플로우 템플릿 (v40) How I Built 100 Browser-Based Image Tools With No Server (FFmpeg WASM, PDF-lib, AI Background Removal) Nginx CVE-2026-9256, AI Prompt Injection Defenses, and Claude AI Data Leak Demo Scaling RAG for 10M+ Docs, .md Agent Memory, & Claude Code for Motion Graphics Diagram as Code with draw.io DuckDB Delta, PostgreSQL 17 Migration, & SQLite Optimization Deep Dives Windows 11 Microsoft Account Login Recovery During Internet Restrictions The Linux Commands You Forgot Exist (And Why AI Workflows Make Them Relevant Again) Spec-Driven Development Without an IDE: I Generated NestJS, Go, Spring Boot, Laravel, and Rust Apps From a Single PRD File Components are states Edge SEO y Middleware: Cómo Interceptar a Googlebot y LLMs antes de llegar a tu Servidor Context window exceeded at turn 23. Here's how I track token usage without a tokenizer. My Hermes agent spent $3 before I noticed. Now it can't. My Hermes agent's stop condition was a 40-line if/elif chain. I replaced it with 3 lines. My agent kept hitting context limits. This one function fixed it. Create and configure Azure Firewall Your Hermes agent's audit log is leaking customer emails. Here's a 100-line lib that fixes that. My agent kept forgetting what it was doing. A scratchpad fixed it. I replaced 200 lines of ad-hoc state management in my Hermes agent with one object. Per-Key Rate Limiting for Agent Tool Calls: Stop One User From Breaking Everything Composable Output Guardrails: Filter Agent Responses Before They Reach Users Sanitize Your LLM Message Lists Before Every API Call Thread a Run ID Through Every Agent Call So You Can Debug Anything Normalize Provider Error JSON So Your Agent Can Actually Handle Failures Priority Queue for Agent Sub-Tasks: Stop Processing Low-Priority Work First Static Lint Rules for Your LLM Prompts (Before They Hit Production) tool-call-budgets: Stop Runaway Agent Loops Before They Hit Your Invoice
Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET)
Alex Spinov · 2026-05-26 · via DEV Community

Alex Spinov

Note: This is a cross-post. Canonical version (full long-form) lives on my blog: https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/

TL;DR

The "ethical scraping" debate keeps arguing about robots.txt and ToS. Those are real, but they're decisions you make once, before the first request. They tell you nothing about run 200, 600, or 900 — and that's where you actually load someone's server and where you actually get banned. (Good prompt for this post: Federico Trotta's "How to Scrape Open-Source Datasets Ethically" on The Web Scraping Club, May 24, 2026 — his line that a scraper "that would barely register as noise on Amazon's servers could genuinely degrade performance for a public data portal" is the part the robots.txt debate keeps skipping.)

After 2,190 production scrapes across 32 scrapers (the busiest, a Trustpilot review scraper, has 962 runs on its own), I'm convinced of one thing: on a real schedule, "polite to the source" and "doesn't get banned" stop being two questions and become one. And the answer is mostly conditional GET plus a sane rate limit — not a robots.txt checkbox.

Where those numbers come from: my own Apify dashboard (apify.com/knotless_cadence), as of May 2026. 2,190 = total runs summed across my 32 published actors; 962 = the Trustpilot scraper's own lifetime counter. Raw platform numbers, not sampled or extrapolated.

This is the practical, code-first version. The long-form reasoning (and what 962 runs against one site actually taught me) is on the canonical post above.

The mechanism most scrapers skip: conditional GET

It's not a hack — it's in the HTTP standard (RFC 9110 §13, and the older focused RFC 7232: Conditional Requests). Most servers will tell you whether a page changed before sending the body — for free — if you ask right:

  • Server sends ETag and/or Last-Modified on the response.
  • You send them back as If-None-Match / If-Modified-Since on the next request.
  • Nothing changed → server replies 304 Not Modified with an empty body. You skip parsing. The source barely does any work.

A 304 is the most considerate response you can get: you confirmed there's no new data without making the server render and ship a page you already have. You also stop feeding duplicate rows into your pipeline.

The fetcher (runnable, ~15 lines of logic)

Plain httpx. Persists its cache to disk so it survives across runs. Throttles itself so it doesn't hammer one host. requests works identically — same header names, same 304.

import time
import json
import os
import hashlib
import httpx


class PoliteFetcher:
    """Conditional-GET fetcher.

    Stores each URL's ETag / Last-Modified, sends them back as
    If-None-Match / If-Modified-Since on the next fetch, and sleeps
    `min_interval` seconds between hits to keep load on the source low.

    A 304 response means: nothing changed, no body sent, skip parsing.
    """

    def __init__(self, cache_path="cache.json", min_interval=1.0,
                 user_agent="polite-scraper/1.0 (+you@example.com)"):
        self.cache_path = cache_path
        self.min_interval = min_interval
        self.user_agent = user_agent
        self._last_hit = 0.0
        self.cache = {}
        if os.path.exists(cache_path):
            with open(cache_path) as f:
                self.cache = json.load(f)

    def _throttle(self):
        wait = self.min_interval - (time.monotonic() - self._last_hit)
        if wait > 0:
            time.sleep(wait)
        self._last_hit = time.monotonic()

    def get(self, url):
        meta = self.cache.get(url, {})
        headers = {"User-Agent": self.user_agent}
        if meta.get("etag"):
            headers["If-None-Match"] = meta["etag"]
        if meta.get("last_modified"):
            headers["If-Modified-Since"] = meta["last_modified"]

        self._throttle()
        r = httpx.get(url, headers=headers, timeout=20)

        if r.status_code == 304:
            # No new data. The server did almost no work. Reuse what we have.
            return {"status": 304, "changed": False,
                    "body_hash": meta.get("body_hash")}

        if r.status_code == 200:
            body_hash = hashlib.sha256(r.content).hexdigest()
            self.cache[url] = {
                "etag": r.headers.get("etag"),
                "last_modified": r.headers.get("last-modified"),
                "body_hash": body_hash,
            }
            with open(self.cache_path, "w") as f:
                json.dump(self.cache, f)
            return {"status": 200, "changed": True,
                    "body_hash": body_hash, "content": r.content}

        # 4xx / 5xx — let the caller decide on retry/backoff.
        return {"status": r.status_code, "changed": None, "body_hash": None}

Enter fullscreen mode Exit fullscreen mode

Verify it in 5 minutes

httpbingo.org has an /etag/{tag} endpoint that hands back an ETag and honors If-None-Match:

f = PoliteFetcher(min_interval=0.5)
url = "https://httpbingo.org/etag/demo123"

print(f.get(url)["status"])   # 200  -> first time, full download
print(f.get(url)["status"])   # 304  -> server says "you already have it"
print(f.get(url)["status"])   # 304  -> still nothing new

Enter fullscreen mode Exit fullscreen mode

Output when I ran it:

run 1: {'status': 200, 'changed': True,  'body_hash': '<your-hash>'}
run 2: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}
run 3: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}

Enter fullscreen mode Exit fullscreen mode

Your body_hash will differ — httpbingo echoes your request headers (User-Agent, timestamps) into the body, so the hex is yours, not mine. What's reproducible is the status sequence 200 → 304 → 304, not the hash.

The other half: rate limit as courtesy, not config

The _throttle() above is deliberately dumb — one fixed delay per host. You usually don't need clever. You need a delay a human reading the access log wouldn't flinch at. Three rules I actually follow:

  • One host at a time, or close to it. Concurrency across different domains is fine. Twenty workers on one domain is the anomaly that gets a rule written about you. My longest-surviving runs were low-concurrency-per-host. Boring wins.
  • Honor 429 / Retry-After. That header is the source literally telling you the polite interval. Ignoring it escalates a soft throttle into a hard ban.
  • Spread scheduled runs out. It's a cron job — spreading the budget over an hour costs you nothing and flattens the load spike on their side.

None of these live in robots.txt. The ethical rate limit lives in your code.

Honesty about "which sources stay up"

I can't hand you a ranked uptime table of named sites — I don't have clean enough per-source numbers to publish one without inventing it, and inventing numbers is the fastest way to make a scraping post worthless. What I can say from 2,190 runs: the sources that kept working were the ones where my scraper behaved like a considerate guest (conditional GET, a delay, an honest User-Agent). The ones I lost were usually the ones where I got greedy with concurrency or skipped conditional GET because "it's just a few thousand pages."

I know that last one from getting it wrong. The first version of one of my recurring scrapers had no conditional-GET layer — I skipped it thinking "it's a couple thousand pages, I'll add caching later." Around run 200 (rough memory, not a logged number) it started catching throttling it hadn't before. I blamed the site for a week. Then I added the ETag / If-None-Match layer, the per-run request count dropped, and the throttling stopped. The bug was me.

That's a correlation, not a controlled experiment. Some of those lost-access incidents were probably the site changing its own defenses, nothing to do with me — and I can't cleanly separate those out, so I won't pretend the politeness caused the uptime. I'm not going to inflate it into an industry trend with a percentage either. But the direction isn't subtle: politeness and persistence track together. The scraper that's kind to the source is the one still running next quarter.


Full long-form (the reasoning, the 962-runs story, the Monday checklist): https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/

I've run 2,190 production scrapes across 32 scrapers (profile: https://apify.com/knotless_cadence). If you need a recurring scraper that stays up instead of getting throttled on run 200, I build those — tell me the source and the schedule: spinov001@gmail.com.

Drafted with AI assistance, edited and fact-checked by me.