Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET)

Note: This is a cross-post. Canonical version (full long-form) lives on my blog: https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/

TL;DR

The "ethical scraping" debate keeps arguing about robots.txt and ToS. Those are real, but they're decisions you make once, before the first request. They tell you nothing about run 200, 600, or 900 — and that's where you actually load someone's server and where you actually get banned. (Good prompt for this post: Federico Trotta's "How to Scrape Open-Source Datasets Ethically" on The Web Scraping Club, May 24, 2026 — his line that a scraper "that would barely register as noise on Amazon's servers could genuinely degrade performance for a public data portal" is the part the robots.txt debate keeps skipping.)

After 2,190 production scrapes across 32 scrapers (the busiest, a Trustpilot review scraper, has 962 runs on its own), I'm convinced of one thing: on a real schedule, "polite to the source" and "doesn't get banned" stop being two questions and become one. And the answer is mostly conditional GET plus a sane rate limit — not a robots.txt checkbox.

Where those numbers come from: my own Apify dashboard (apify.com/knotless_cadence), as of May 2026. 2,190 = total runs summed across my 32 published actors; 962 = the Trustpilot scraper's own lifetime counter. Raw platform numbers, not sampled or extrapolated.

This is the practical, code-first version. The long-form reasoning (and what 962 runs against one site actually taught me) is on the canonical post above.

The mechanism most scrapers skip: conditional GET

It's not a hack — it's in the HTTP standard (RFC 9110 §13, and the older focused RFC 7232: Conditional Requests). Most servers will tell you whether a page changed before sending the body — for free — if you ask right:

Server sends ETag and/or Last-Modified on the response.
You send them back as If-None-Match / If-Modified-Since on the next request.
Nothing changed → server replies 304 Not Modified with an empty body. You skip parsing. The source barely does any work.

A 304 is the most considerate response you can get: you confirmed there's no new data without making the server render and ship a page you already have. You also stop feeding duplicate rows into your pipeline.

The fetcher (runnable, ~15 lines of logic)

Plain httpx. Persists its cache to disk so it survives across runs. Throttles itself so it doesn't hammer one host. requests works identically — same header names, same 304.

import time
import json
import os
import hashlib
import httpx


class PoliteFetcher:
    """Conditional-GET fetcher.

    Stores each URL's ETag / Last-Modified, sends them back as
    If-None-Match / If-Modified-Since on the next fetch, and sleeps
    `min_interval` seconds between hits to keep load on the source low.

    A 304 response means: nothing changed, no body sent, skip parsing.
    """

    def __init__(self, cache_path="cache.json", min_interval=1.0,
                 user_agent="polite-scraper/1.0 (+you@example.com)"):
        self.cache_path = cache_path
        self.min_interval = min_interval
        self.user_agent = user_agent
        self._last_hit = 0.0
        self.cache = {}
        if os.path.exists(cache_path):
            with open(cache_path) as f:
                self.cache = json.load(f)

    def _throttle(self):
        wait = self.min_interval - (time.monotonic() - self._last_hit)
        if wait > 0:
            time.sleep(wait)
        self._last_hit = time.monotonic()

    def get(self, url):
        meta = self.cache.get(url, {})
        headers = {"User-Agent": self.user_agent}
        if meta.get("etag"):
            headers["If-None-Match"] = meta["etag"]
        if meta.get("last_modified"):
            headers["If-Modified-Since"] = meta["last_modified"]

        self._throttle()
        r = httpx.get(url, headers=headers, timeout=20)

        if r.status_code == 304:
            # No new data. The server did almost no work. Reuse what we have.
            return {"status": 304, "changed": False,
                    "body_hash": meta.get("body_hash")}

        if r.status_code == 200:
            body_hash = hashlib.sha256(r.content).hexdigest()
            self.cache[url] = {
                "etag": r.headers.get("etag"),
                "last_modified": r.headers.get("last-modified"),
                "body_hash": body_hash,
            }
            with open(self.cache_path, "w") as f:
                json.dump(self.cache, f)
            return {"status": 200, "changed": True,
                    "body_hash": body_hash, "content": r.content}

        # 4xx / 5xx — let the caller decide on retry/backoff.
        return {"status": r.status_code, "changed": None, "body_hash": None}

Verify it in 5 minutes

httpbingo.org has an /etag/{tag} endpoint that hands back an ETag and honors If-None-Match:

f = PoliteFetcher(min_interval=0.5)
url = "https://httpbingo.org/etag/demo123"

print(f.get(url)["status"])   # 200  -> first time, full download
print(f.get(url)["status"])   # 304  -> server says "you already have it"
print(f.get(url)["status"])   # 304  -> still nothing new

Output when I ran it:

run 1: {'status': 200, 'changed': True,  'body_hash': '<your-hash>'}
run 2: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}
run 3: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}

Your body_hash will differ — httpbingo echoes your request headers (User-Agent, timestamps) into the body, so the hex is yours, not mine. What's reproducible is the status sequence 200 → 304 → 304, not the hash.

The other half: rate limit as courtesy, not config

The _throttle() above is deliberately dumb — one fixed delay per host. You usually don't need clever. You need a delay a human reading the access log wouldn't flinch at. Three rules I actually follow:

One host at a time, or close to it. Concurrency across different domains is fine. Twenty workers on one domain is the anomaly that gets a rule written about you. My longest-surviving runs were low-concurrency-per-host. Boring wins.
Honor 429 / Retry-After. That header is the source literally telling you the polite interval. Ignoring it escalates a soft throttle into a hard ban.
Spread scheduled runs out. It's a cron job — spreading the budget over an hour costs you nothing and flattens the load spike on their side.

None of these live in robots.txt. The ethical rate limit lives in your code.

Honesty about "which sources stay up"

I can't hand you a ranked uptime table of named sites — I don't have clean enough per-source numbers to publish one without inventing it, and inventing numbers is the fastest way to make a scraping post worthless. What I can say from 2,190 runs: the sources that kept working were the ones where my scraper behaved like a considerate guest (conditional GET, a delay, an honest User-Agent). The ones I lost were usually the ones where I got greedy with concurrency or skipped conditional GET because "it's just a few thousand pages."

I know that last one from getting it wrong. The first version of one of my recurring scrapers had no conditional-GET layer — I skipped it thinking "it's a couple thousand pages, I'll add caching later." Around run 200 (rough memory, not a logged number) it started catching throttling it hadn't before. I blamed the site for a week. Then I added the ETag / If-None-Match layer, the per-run request count dropped, and the throttling stopped. The bug was me.

That's a correlation, not a controlled experiment. Some of those lost-access incidents were probably the site changing its own defenses, nothing to do with me — and I can't cleanly separate those out, so I won't pretend the politeness caused the uptime. I'm not going to inflate it into an industry trend with a percentage either. But the direction isn't subtle: politeness and persistence track together. The scraper that's kind to the source is the one still running next quarter.

Full long-form (the reasoning, the 962-runs story, the Monday checklist): https://blog.spinov.online/blog/ethical-scraping-is-a-rate-limit-question/

I've run 2,190 production scrapes across 32 scrapers (profile: https://apify.com/knotless_cadence). If you need a recurring scraper that stays up instead of getting throttled on run 200, I build those — tell me the source and the schedule: spinov001@gmail.com.

Drafted with AI assistance, edited and fact-checked by me.

推荐订阅源

DEV Community