惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

DEV Community

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It) Decoding Solana Account Data: Three Methods Compared MCP Just Landed on Your Phone: What Google AI Edge Gallery Actually Does I Made My Website "Alive" using Physics (Vanilla JS Experiment Part 2) 🚀 Stop Asking “Which Model?” and Start Fixing Your Team’s AI Supply Chain [Image Test C] How to Prompt AI Tools to Write Accurate SQL Queries (And Why Most Developers Get This Wrong) Stop Asking “Which Model?” and Start Fixing Your Team’s AI Supply Chain [Image Test B] Stop Asking “Which Model?” and Start Fixing Your Team’s AI Supply Chain [Image Test A] PayPal and Stripe Are Not the Answer for Global Digital Sales Signs your WordPress site needs a headless CMS rebuild Sanity CMS vs Contentful for Next.js projects: an honest comparison Sanity vs Strapi vs Payload CMS: an honest comparison for 2026 Sanity CMS website cost in 2026: what founders actually pay INP for React Apps: Profiling and Eliminating Long Tasks Why Core Web Vitals Matter (and How I Improve Them) Why AI Agents Love Boring Code I got tired of manual WordPress maintenance across 8 client sites - so I automated all of it My PR Merged Into a Graveyard: On the Rise of Antigravity and the Fall of Open Source Private AI on a Normal Android Phone: Building Krexel with Gemma 4 E2B The Best Result This Week Was a Failed Prediction — Phase-3a Doesn't Transfer Embedding 685 million texts in 32 minutes I Asked the Top 6 AI Chatbots to Sell Me on Themselves - Then Asked Each One Who Came Second Hello World JahSeeToo The First Malaysia's Hacker i watched google tear down the old internet from a hostel room in kolkata How I audit and prune unused Sanity document types to reclaim Studio performance What is MCP, and why it's the missing layer between AI and your CRM Stop adding print statements to debug your data pipeline — use watcher instead Hire a Sanity developer vs agency: five honest trade-offs Temporal vs Make for API-First Workflows The Antigravity 2.0 Forced Update: How to Fix the Broken Editor Loop 10 Ways To Reduce Your LLM API Costs mcp-probe v1.0.0: A CI readiness gate for MCP servers Building ValoVault: The Per-Agent Skin Loadouts Riot Never Shipped Most CMS Platforms Aren’t Built for Full Lifecycle Ownership 45 MB of Claude Code Sessions You Don't See Building a Resilient Checkout in NestJS: Retry, Idempotency, and a System That Tunes Itself Html learning journey I built an open-source alternative to ViciDial. Here's the stack, and the bugs that ate my nights. Zero-PC Architecture: Deploying Webhooks & AI Triage from a Mobile Footprint Why AI Coding Agents Fail Senior Engineers (And What I Built to Fix It) Stop Pasting URLs into Security Header Sites - Use This CLI 26 of 39 AI Companies Use SPF Softfail — Their Email Can Be Spoofed Mastering useRef in React: The Hook That Gives React Memory Without Re-Rendering One Brain, Many Hands: Building a Parallel Task Orchestrator for AI Agents Understanding useRef in React: Concepts, Use Cases, and Examples An AI That Can't Trade, a Human That Can't Say No SSH died. Spent 3 hours fixing the wrong thing. ## Rise of the Managed Agent: Why Antigravity 2.0 is Google I/O 2026’s Most Critical Developer Release From Concept to Production: A Technical Guide to Deploying Markus Multi-Agent Systems Why Browsers Outpaced Web Tooling (And How We Catch Up) First Principles Building a Safety-First RAG Triage Agent in Python Gemma 4 Isn’t Just Another AI Model — It’s A Shift In How We Build AI The Feature Store: Consistency and Latency Are Both Non-Negotiable What did gemma see? - Thinking in comments... I Built a Desktop Chat App for Running Local LLMs Offline Alert Fatigue Is a Design Choice: Building Views That Actually Help Building A Laravel Google Sheets Package That Imports, Exports, Caches, Formats, And Tests Cleanly DOM Accessibility Tree Extraction: A Reliable Method for LLMs on Dynamic Web Tables Building a Production Grade AWS Infrastructure Project (Part 1) Google just shifted the agent workflow from the cloud to the desktop I built a Claude skill that keeps your AI coding tools from contradicting each other — and I need beta testers Google I/O 2026 - Day 1 - Live from the Front Row The Effect of Frosted Glass (Glassmorphism) in Pure CSS in 2026 Gemini vs. ChatGPT for Coding: A Developer's Guide Solana's Account Types Are Just Database Rows With Different Flags Cryptographic Forensics for AI Coding Agent Sessions Testing NGB Platform Beyond a Small Demo Dataset with k6 and TypeScript Metabase 61: AI fun police, build questions and dashboards with MCP, and much more! How GBase 8a Rough Index Works: Block‑Level Pruning for 10x Faster Queries The Anti-Antigravity Bulkhead vs Rate limiting. The Age of Accountable Agents: Building Trust in Your AI Automation Securing Your AI Agents: Essential Practices for On-Device Automation I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems The Code Nobody Will Delete Building a desktop studio for interactive video stories like Late Shift - Devlog #1 Solving the Local AI Sandbox Issue: How TaigaAI Keeps Your Workstation Safe Why Enterprises Will Struggle With MCP — And What to Do About It Why I Finally Added a Blog to My Converter Tool When Your Coding Agent's String-Matcher Becomes a Billing Decision Building ThreatPulse IDS: An AI-Powered Intrusion Detection System I Built a Register-VM JavaScript Engine in Rust with opencode.ai — Beating QuickJS Per-User OAuth for AI Agents: Why It Matters and What to Look For You Got Your Whole Genome Sequenced. Now What? Zero to Full-Stack in 6 Months: The Izzy Way... PasteCheck v1.3 — what I improved after launching and getting real users DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem How Strong Is "Strong"? Password Entropy in Plain English Precision Mechatronics: Mitigating Step-Pulse Resonance and Thermal Dissipation in Micro-Stepping Hardware Controllers A Fact A Day, an autonomous Podcast as my entry 4 Hermes Agent Challenge #100DaysOfSolana Day29: My Experience Generating Token On Solana Devnet Overcoming Challenges and Applying Best Practices in Migrating Large JavaScript Codebases to TypeScript Decostruire lo Streaming di FC2: Come Costruire un Downloader ad Alte Prestazioni con HLS e WebAssembly Top 10 Agentic AI Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs... (Benchmarks Inside) How I Built a Hermes Agent for Lead Generation That Finds and Qualifies Better Prospects The Hybrid Method: when Claude.ai supervises Claude Code LLMs Are Probabilistic. Your Workflow Shouldn't Be. Deploying Tempo Distributed Tracing Backend on Ubuntu 24.04
Stop Building Fragile Scrapers — Build Actors Instead
SIÁN Agency · 2026-05-18 · via DEV Community

TL;DR — A "scraper" is a script that ran once. An "actor" is a unit of work with an input contract, an output schema, observability, and a billing model. Same code, completely different operational surface. We migrated our Bayut property pipeline from the first to the second this quarter and the support load dropped 70%.

I get sent a lot of scraper repos to "review" — usually after they've broken in production. They look surprisingly similar:

  • One Python file, 300–600 lines.
  • A main() that loops over URLs.
  • requests.get() plus BeautifulSoup plus a try/except: pass that swallows everything.
  • Output written to a CSV called output.csv in the working directory.
  • A cron job that triggers it nightly. Sometimes a Slack webhook on failure that stopped working six months ago.

This is what I call a script that ran once. The fact that it ran in production doesn't make it production code.

The teardown is always the same.

The five failure modes you inherit when you ship a script

  1. No input contract. The script reads URLs from a hardcoded list or a file path that only exists on your laptop. New requirement → edit the file → redeploy → hope.
  2. No output schema. Whatever fields happened to be present this run get written. When the source site adds a column, the CSV silently widens. When the source site removes a column, downstream breaks at parse time, three hops away from the cause.
  3. No observability. "Did it run last night?" is answered by SSH-ing to the box and ls -la output.csv. Run history is the file's mtime. Failure mode is "the file is older than expected."
  4. No retries with backoff. A 503 from the target site at 02:14 kills the run. There is no second attempt. The next run is in 24 hours.
  5. No billing surface. The cost of running it is your time and your server. There is no per-unit price, so there is no signal that the unit economics are bad until you check the AWS bill.

A script is fine for "I need this data once." It is not fine for "we need this data nightly for the next two years." But teams keep shipping #1 to fulfill #2.

What an actor is

Strip the marketing word and an actor is just: a containerised job with a declared input schema, a declared output schema, and a runtime that handles scheduling, retries, logs, persistent storage, and billing. Apify is one implementation — there are others. The shape matters more than the vendor.

When we rebuilt our Bayut property scraper as an actor, four things changed at the level of code:

// 1. Input is validated against a schema before main() runs.
//    Bad input fails fast with a useful error, not silent miss.
const input = await Actor.getInput(); // INPUT_SCHEMA.json enforces shape

// 2. Output goes to a typed dataset. New fields require a schema
//    change — not a silent CSV widening.
await Dataset.pushData({
  listingId, price, currency, address, lat, lng, scrapedAt
});

// 3. Failures retry with backoff at the platform level.
//    Our code throws; the runtime decides what to do.
throw new ScrapeFailure('listing-blocked', { url, status: 429 });

// 4. Logs are structured, queryable, and indexed by run.
log.warning('rate-limit', { url, retryAfter: 60 });

Enter fullscreen mode Exit fullscreen mode

That's it. Same Playwright, same selectors, same scraping logic. The difference is that all the boring infrastructure — input validation, output typing, retries, logs, scheduling, billing — is no longer your problem.

Fig. 1 — Concerns owned by the developer (script) vs. concerns owned by the runtime (actor).

Result

For Bayut specifically, three months after the migration:

  • Mean time to detect a breakage went from ~36 hours (next-day stakeholder complaint) to under 15 minutes (failed runs alert with the offending URL and HTTP status).
  • Support tickets dropped 70%. Most of the volume was "the data is missing" — invisible failures from the cron-script era. With per-run datasets, failed runs surface themselves.
  • Cost per 1000 listings went down, not up. Concurrency at the runtime level is cheaper than spinning up your own queue.

The migration itself took about a week. Most of the time was not the scraping logic — that was already there. It was deciding what the input schema should be, what the output schema should be, and which fields were "nice to have" vs "the dataset is broken without this."

The replacement pattern

If you're sitting on a script-shaped scraper right now, the migration order is:

  1. Write the input schema. Force every run to declare what it's scraping.
  2. Write the output schema. Force every row to validate before it gets persisted.
  3. Move retries from try/except: pass to the runtime.
  4. Replace print() with structured logs.
  5. Containerise. Whatever runs in python main.py should run in docker run.
  6. Pick a runtime — Apify, your own k8s cron, whatever. The schema work is portable.

You do steps 1–5 inside your existing repo. You haven't committed to a vendor yet. By the time you reach step 6, the actor exists — the runtime is just a deployment target.

We packaged this migration shape into a starter we use for every new client engagement — same six steps that produced the Bayut property scraper above. Same six steps, every time.

Which of the five failure modes is currently shipping in your stack? Drop it in the comments — I'll point at the smallest change that fixes it.


Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.