惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Why Hardcoded Automations Fail AI Agents Stop Calling It an AI Assistant. It’s Already Managing Your Company Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia AI Is Changing Engineering Culture More Than We Realize Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team Your AI Has a Memory. It Just Doesn’t Know What to Remember. Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6) Building a Daily Google News API Monitor in Python Building RookDuel Avikal: From Chess Steganography to Post-Quantum Archival Security Google I/O e IA: o que realmente muda na vida do dev? Color Contrast Failures: The Number One Accessibility Issue and How to Fix It # I Watched 15 Hours of Hermes Agent Videos So You Don't Have To Cómo solucionar el bucle infinito en useEffect con objetos y arrays en React The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose Most Treasure Hunts Engines on Hytale Servers Are Built to Fail - Lessons from a Burned Database GhostScan v3.0 — From Closed-Source EXE to Open-Source Pentest Framework De hojas de cálculo a IA: construyendo una plataforma SRM moderna When is AI fine in education? Python Tools for Managing API Rate Limits in Data Pipelines How to Implement Exponential Backoff for Rate-Limited APIs in Python "My Web Chat Wasn't a Real Channel. That Broke My Agent Pipeline" next-advanced-sitemap v1.0.7 — safer URL ingestion & automatic trimming for Next.js sitemap generation I keep seeing people build an AI lead processing agent when they really need a 6-step rules engine AI Powered Student Learning Assistant Using Gemma 4 How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops
一个10行Playwright技巧,每次Sephora购物都能为我节省数小时
SIÁN Agency · 2026-05-22 · via DEV Community

大多数 Playwright 教程教你抓取单个页面。真正的抓取器需要抓取成千上万个页面。杀死你的不是选择器——而是 Playwright 在它触及选择器之前所做的一切操作

默认情况下,Playwright 会像人类访问网站一样加载页面。它会下载 CSS、字体、分析脚本、A/B 测试像素、主视觉图片、懒加载轮播图和三种不同的聊天小部件。在一个产品目录页面上,这些都是你不需要的 4-6 MB 的内容。乘以 10,000 个页面,这就是 20 分钟运行和 3 小时运行之间的区别。

这是我放入每个角色的 10 行路由处理程序:

const BLOCKED = ['image', 'media', 'font', 'stylesheet'];

await context.route('**/*', (route) => {
  const type = route.request().resourceType();
  const url  = route.request().url();
  if (BLOCKED.includes(type)) return route.abort();
  if (/google-analytics|doubleclick|hotjar|segment|gtm/.test(url)) {
    return route.abort();
  }
  route.continue();
});

进入全屏模式 退出全屏模式

就这样。两个列表:你不需要的资源类型,以及你绝对不需要的跟踪域.

在你发布前要检查的3项清单

  1. 测试你的数据是否还在. 有些网站懒加载产品信息到图片中data- 属性。中止图片有时会中断提取。带和不带路由处理器的命令行运行并比较输出结果。
  2. 不要阻塞脚本。 现代网站使用JS构建DOM。中止脚本会得到一个空页面。(CSS和字体是安全的——Playwright不需要它们来定位选择器。)
  3. 注意那些检测到这种情况的网站。 一些反机器人脚本会检查你是否获取了分析像素。如果启用这个功能后你的成功率下降,允许分析域名再次通过。

Fig. 1 — Page weight before vs after the block list. Same DOM, less network.

快速案例

在我们的Sephora产品信息展示器上,这一项改动将平均页面加载时间从4.8秒缩短到1.3秒。在处理一个包含5000个产品的目录抓取时,这就是从6.5小时减少到1.8小时的区别。使用相同的CSS选择器、相同的数据、相同的成功率。我们只是停止了下载那些我们从不查看的保湿产品的英雄图片。

它还使我们的Apify每次运行的计算单元减少了约60%,这直接影响我们向客户收取的费用。速度更快,成本更低,输出相同。路由处理程序现在随附丝芙兰产品信息行为者,以及之后的每个新抓取器。

你没有要求号召性用语

这个路由处理器随我们的启动演员模板一起提供。新的爬虫第一天就能获得它。旧的爬虫在我们第一次注意到运行时时就被安装了。>1小时。

该模式适用于任何基于浏览器的抓取器——Playwright、Puppeteer、带有 CDP 的 Selenium。其形态始终是:告诉浏览器什么在加载之前,告诉它要寻找什么。

对那些重度依赖 JS 的朋友们说一句:同样的模式适用于 Puppeteer 的 page.setRequestInterception(true) — 同样的思路,API 略有不同。效果一样好。

把你最慢的爬虫的运行时间发在评论区。 我猜猜是什么在消耗你的分钟。 (提示:很可能是英雄图片。)

同意,不同意,或者有一个网站,阻止图片会导致某些微妙的问题?回复一下。


由**Nova Chen撰写,SIÁN Agency的自动化开发倡导者。在dev.to上了解更多关于Nova的信息。如需定制抓取或自动化工作,请雇佣SIÁN Agency