Stop Building Fragile Scrapers — Build Actors Instead

TL;DR — A "scraper" is a script that ran once. An "actor" is a unit of work with an input contract, an output schema, observability, and a billing model. Same code, completely different operational surface. We migrated our Bayut property pipeline from the first to the second this quarter and the support load dropped 70%.

I get sent a lot of scraper repos to "review" — usually after they've broken in production. They look surprisingly similar:

One Python file, 300–600 lines.
A main() that loops over URLs.
requests.get() plus BeautifulSoup plus a try/except: pass that swallows everything.
Output written to a CSV called output.csv in the working directory.
A cron job that triggers it nightly. Sometimes a Slack webhook on failure that stopped working six months ago.

This is what I call a script that ran once. The fact that it ran in production doesn't make it production code.

The teardown is always the same.

The five failure modes you inherit when you ship a script

No input contract. The script reads URLs from a hardcoded list or a file path that only exists on your laptop. New requirement → edit the file → redeploy → hope.
No output schema. Whatever fields happened to be present this run get written. When the source site adds a column, the CSV silently widens. When the source site removes a column, downstream breaks at parse time, three hops away from the cause.
No observability. "Did it run last night?" is answered by SSH-ing to the box and ls -la output.csv. Run history is the file's mtime. Failure mode is "the file is older than expected."
No retries with backoff. A 503 from the target site at 02:14 kills the run. There is no second attempt. The next run is in 24 hours.
No billing surface. The cost of running it is your time and your server. There is no per-unit price, so there is no signal that the unit economics are bad until you check the AWS bill.

A script is fine for "I need this data once." It is not fine for "we need this data nightly for the next two years." But teams keep shipping #1 to fulfill #2.

What an actor is

Strip the marketing word and an actor is just: a containerised job with a declared input schema, a declared output schema, and a runtime that handles scheduling, retries, logs, persistent storage, and billing. Apify is one implementation — there are others. The shape matters more than the vendor.

When we rebuilt our Bayut property scraper as an actor, four things changed at the level of code:

// 1. Input is validated against a schema before main() runs.
//    Bad input fails fast with a useful error, not silent miss.
const input = await Actor.getInput(); // INPUT_SCHEMA.json enforces shape

// 2. Output goes to a typed dataset. New fields require a schema
//    change — not a silent CSV widening.
await Dataset.pushData({
  listingId, price, currency, address, lat, lng, scrapedAt
});

// 3. Failures retry with backoff at the platform level.
//    Our code throws; the runtime decides what to do.
throw new ScrapeFailure('listing-blocked', { url, status: 429 });

// 4. Logs are structured, queryable, and indexed by run.
log.warning('rate-limit', { url, retryAfter: 60 });

That's it. Same Playwright, same selectors, same scraping logic. The difference is that all the boring infrastructure — input validation, output typing, retries, logs, scheduling, billing — is no longer your problem.

Result

For Bayut specifically, three months after the migration:

Mean time to detect a breakage went from ~36 hours (next-day stakeholder complaint) to under 15 minutes (failed runs alert with the offending URL and HTTP status).
Support tickets dropped 70%. Most of the volume was "the data is missing" — invisible failures from the cron-script era. With per-run datasets, failed runs surface themselves.
Cost per 1000 listings went down, not up. Concurrency at the runtime level is cheaper than spinning up your own queue.

The migration itself took about a week. Most of the time was not the scraping logic — that was already there. It was deciding what the input schema should be, what the output schema should be, and which fields were "nice to have" vs "the dataset is broken without this."

The replacement pattern

If you're sitting on a script-shaped scraper right now, the migration order is:

Write the input schema. Force every run to declare what it's scraping.
Write the output schema. Force every row to validate before it gets persisted.
Move retries from try/except: pass to the runtime.
Replace print() with structured logs.
Containerise. Whatever runs in python main.py should run in docker run.
Pick a runtime — Apify, your own k8s cron, whatever. The schema work is portable.

You do steps 1–5 inside your existing repo. You haven't committed to a vendor yet. By the time you reach step 6, the actor exists — the runtime is just a deployment target.

We packaged this migration shape into a starter we use for every new client engagement — same six steps that produced the Bayut property scraper above. Same six steps, every time.

Which of the five failure modes is currently shipping in your stack? Drop it in the comments — I'll point at the smallest change that fixes it.

Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

推荐订阅源

DEV Community

The five failure modes you inherit when you ship a script

What an actor is

Result

The replacement pattern