惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - 司徒正美
大猫的无限游戏
大猫的无限游戏
Scott Helme
Scott Helme
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
S
Secure Thoughts
Google DeepMind News
Google DeepMind News
博客园_首页
Hacker News: Ask HN
Hacker News: Ask HN
量子位
Jina AI
Jina AI
I
InfoQ
V
V2EX
Martin Fowler
Martin Fowler
Y
Y Combinator Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
人人都是产品经理
人人都是产品经理
B
Blog
IT之家
IT之家
云风的 BLOG
云风的 BLOG
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - Franky
博客园 - 【当耐特】
N
Netflix TechBlog - Medium
Cloudbric
Cloudbric
H
Heimdal Security Blog
TaoSecurity Blog
TaoSecurity Blog
S
Security @ Cisco Blogs
U
Unit 42
Project Zero
Project Zero
Webroot Blog
Webroot Blog
The Register - Security
The Register - Security
N
News | PayPal Newsroom
Microsoft Security Blog
Microsoft Security Blog
H
Help Net Security
Forbes - Security
Forbes - Security
宝玉的分享
宝玉的分享
Last Week in AI
Last Week in AI
C
Check Point Blog
博客园 - 聂微东
M
MIT News - Artificial intelligence
有赞技术团队
有赞技术团队
D
DataBreaches.Net
Cyberwarzone
Cyberwarzone
N
News and Events Feed by Topic
N
News and Events Feed by Topic
Simon Willison's Weblog
Simon Willison's Weblog
J
Java Code Geeks
G
Google Developers Blog
GbyAI
GbyAI
T
Threatpost

MarkTechPost

A Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features Meta Introduces Autodata: An Agentic Framework That Turns AI Models into Autonomous Data Scientists for High-Quality Training Data Creation Qwen AI Releases Qwen-Scope: An Open-Source Sparse AutoEncoders (SAE) Suite That Turns LLM Internal Features into Practical Development Tools A Coding Deep Dive into Agentic UI, Generative UI, State Synchronization, and Interrupt-Driven Approval Flows Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes A Coding Implementation on Pyright Type Checking Covering Generics, Protocols, Strict Mode, Type Narrowing, and Modern Python Typing IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs Step by Step Guide to Build a Complete PII Detection and Redaction Pipeline with OpenAI Privacy Filter Meta FAIR Releases NeuralSet: A Python Package for Neuro-AI That Supports fMRI, M/EEG, Spikes, and HuggingFace Embeddings smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3 A Coding Implementation on Document Parsing Benchmarking with LlamaIndex ParseBench Using Python, Hugging Face, and Evaluation Metrics Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters Top 10 Physical AI Models Powering Real-World Robots in 2026 How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control Meet Talkie-1930: A 13B Open-Weight LLM Trained on Pre-1931 English Text for Historical Reasoning and Generalization Research Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo The LoRA Assumption That Breaks in Production How to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models RAG Without Vectors: How PageIndex Retrieves by Reasoning A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and More A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures Xiaomi Releases MiMo-V2.5-Pro and MiMo-V2.5: Matching Frontier Model Benchmarks at Significantly Lower Token Cost How to Design a Production-Grade CAMEL Multi-Agent System with Planning, Tool Use, Self-Consistency, and Critique-Driven Refinement Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks A Detailed Implementation on Equinox with JAX Native Modules, Filtered Transforms, Stateful Layers, and End-to-End Training Workflows Next Leap to Harness Engineering: JiuwenClaw Pioneers ‘Coordination Engineering’ Photon Releases Spectrum: An Open-Source TypeScript Framework that Deploys AI Agents Directly to iMessage, WhatsApp, and Telegram OpenAI Open-Sources Euphony: A Browser-Based Visualization Tool for Harmony Chat Data and Codex Session Logs Hugging Face Releases ml-intern: An Open-Source AI Agent that Automates the LLM Post-Training Workflow A Coding Implementation to Build a Conditional Bayesian Hyperparameter Optimization Pipeline with Hyperopt, TPE, and Early Stopping Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer How TabPFN Leverages In-Context Learning to Achieve Superior Accuracy on Tabular Datasets Compared to Random Forest and CatBoost A Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI NVIDIA Releases Ising: the First Open Quantum AI Model Family for Hybrid Quantum-Classical Systems xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG A Coding Guide for Property-Based Testing Using Hypothesis with Stateful, Differential, and Metamorphic Test Design Anthropic Releases Claude Opus 4.7: A Major Upgrade for Agentic Coding, High-Resolution Vision, and Long-Horizon Autonomous Tasks Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows Top 19 AI Red Teaming Tools (2026): Secure Your ML Models A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control Qwen Team Open-Sources Qwen3.6-35B-A3B: A Sparse MoE Vision-Language Model with 3B Active Parameters and Agentic Coding Capabilities OpenAI Launches GPT-Rosalind: Its First Life Sciences AI Model Built to Accelerate Drug Discovery and Genomics Research Building Transformer-Based NQS for Frustrated Spin Systems with NetKet UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size How to Build a Universal Long-Term Memory Layer for AI Agents Using Mem0 and OpenAI A Coding Implementation to Build Multi-Agent AI Systems with SmolAgents Using Code Execution, Tool Calling, and Dynamic Orchestration A Technical Deep Dive into the Essential Stages of Modern Large Language Model Training, Alignment, and Deployment Google AI Launches Gemini 3.1 Flash TTS: A New Benchmark in Expressive and Controllable AI Voice Google DeepMind Releases Gemini Robotics-ER 1.6: Bringing Enhanced Embodied Reasoning and Instrument Reading to Physical AI Google Launches ‘Skills’ in Chrome: Turning Reusable AI Prompts into One-Click Browser Workflows A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction TinyFish AI Releases Full Web Infrastructure Platform for AI Agents: Search, Fetch, Browser, and Agent Under One API Key NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model A Hands-On Coding Tutorial for Microsoft VibeVoice Covering Speaker-Aware ASR, Real-Time TTS, and Speech-to-Speech Pipelines Meta AI and KAUST Researchers Propose Neural Computers That Fold Computation, Memory, and I/O Into One Learned Model A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction MiniMax Just Open Sourced MiniMax M2.7: A Self-Evolving Agent Model that Scores 56.22% on SWE-Pro and 57.0% on Terminal Bench 2 Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference Researchers from MIT, NVIDIA, and Zhejiang University Propose TriAttention: A KV Cache Compression Method That Matches Full Attention at 2.5× Higher Throughput How to Build a Secure Local-First Agent Runtime with OpenClaw Gateway, Skills, and Controlled Tool Execution How Knowledge Distillation Compresses Ensemble Intelligence into a Single Deployable AI Model Alibaba’s Tongyi Lab Releases VimRAG: a Multimodal RAG Framework that Uses a Memory Graph to Navigate Massive Visual Contexts A Coding Guide to Markerless 3D Human Kinematics with Pose2Sim, RTMPose, and OpenSim NVIDIA Releases AITune: An Open-Source Inference Toolkit That Automatically Finds the Fastest Inference Backend for Any PyTorch Model Five AI Compute Architectures Every Engineer Should Know: CPUs, GPUs, TPUs, NPUs, and LPUs Compared An End-to-End Coding Guide to NVIDIA KVPress for Long-Context LLM Inference, KV Cache Compression, and Memory-Efficient Generation Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context A Coding Guide to Build Advanced Document Intelligence Pipelines with Google LangExtract, OpenAI Models, Structured Extraction, and Interactive Visualization Google AI Research Introduces PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export
Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export
Sana Hassan · 2026-06-21 · via MarkTechPost

In this tutorial, we build a full Crawlee-for-Python workflow that covers environment setup, local website generation, static crawling, dynamic crawling, structured extraction, and downstream data processing. We begin by configuring a compatible Crawlee runtime with pinned Pydantic support, Playwright browser installation, persistent storage directories, and Colab-safe execution handling. We then generate a realistic local demo website containing product pages, documentation pages, blog content, internal links, robots.txt rules, JSON-LD metadata, and JavaScript-rendered catalog items. Using BeautifulSoupCrawler, we perform fast recursive HTML crawling and extract page titles, metadata, text previews, outgoing links, product attributes, documentation headings, code blocks, and blog tags. With ParselCrawler, we run precise CSS- and XPath-based extraction on product detail pages. With PlaywrightCrawler, we render JavaScript content in a headless Chromium browser, wait for dynamic DOM elements to appear, extract client-side data, and capture full-page screenshots.

Setting Up the Crawlee Python Runtime and Helpers

import os
import sys
import re
import csv
import json
import time
import math
import shutil
import socket
import hashlib
import asyncio
import textwrap
import subprocess
import threading
from pathlib import Path
from functools import partial
from http.server import ThreadingHTTPServer, SimpleHTTPRequestHandler
from importlib.metadata import version, PackageNotFoundError
SETUP_SENTINEL = "/content/.crawlee_python_tutorial_setup_done_v2"
def sh(command, check=True, quiet=False):
   print(f"\n$ {command}")
   result = subprocess.run(
       command,
       shell=True,
       text=True,
       stdout=subprocess.PIPE,
       stderr=subprocess.STDOUT,
   )
   if not quiet and result.stdout:
       print(result.stdout[-5000:])
   if check and result.returncode != 0:
       raise RuntimeError(f"Command failed with exit code {result.returncode}: {command}")
   return result.returncode == 0
def package_version(package_name):
   try:
       return version(package_name)
   except PackageNotFoundError:
       return None
def is_good_pydantic_version(v):
   if not v:
       return False
   m = re.match(r"^(\d+)\.(\d+)", v)
   if not m:
       return False
   major, minor = int(m.group(1)), int(m.group(2))
   return major == 2 and minor == 11
current_crawlee = package_version("crawlee")
current_pydantic = package_version("pydantic")
needs_setup = (
   not os.path.exists(SETUP_SENTINEL)
   or current_crawlee is None
   or not is_good_pydantic_version(current_pydantic)
)
if needs_setup:
   print("PHASE 1: Installing compatible Crawlee + Pydantic + Playwright dependencies.")
   print("After this finishes, Colab will restart automatically. Then run this same cell again.")
   sh(f'{sys.executable} -m pip uninstall -y crawlee pydantic pydantic-core', check=False)
   sh(
       f'{sys.executable} -m pip install -q -U '
       f'"pydantic>=2.11,<2.12" '
       f'"crawlee[all]" '
       f'pandas matplotlib networkx nest_asyncio beautifulsoup4 parsel'
   )
   sh(f'{sys.executable} -m playwright install --with-deps chromium', check=False)
   Path(SETUP_SENTINEL).write_text("done", encoding="utf-8")
   print("\nInstalled versions:")
   sh(f'{sys.executable} -m pip show crawlee pydantic pydantic-core', check=False)
   try:
       import google.colab
       print("\nRestarting Colab runtime now. After it reconnects, run this same cell again.")
       os.kill(os.getpid(), 9)
   except Exception:
       raise SystemExit("Setup complete. Restart the runtime/kernel manually, then run this cell again.")
print("PHASE 2: Dependencies are ready. Running the Crawlee tutorial.")
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import nest_asyncio
nest_asyncio.apply()
TUTORIAL_ROOT = Path("/content/crawlee_python_advanced_tutorial")
SITE_DIR = TUTORIAL_ROOT / "demo_site"
OUTPUT_DIR = TUTORIAL_ROOT / "outputs"
STORAGE_DIR = TUTORIAL_ROOT / "crawlee_storage"
SCREENSHOT_DIR = OUTPUT_DIR / "screenshots"
for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR]:
   if path.exists():
       shutil.rmtree(path)
for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR, SCREENSHOT_DIR]:
   path.mkdir(parents=True, exist_ok=True)
os.environ["CRAWLEE_STORAGE_DIR"] = str(STORAGE_DIR)
os.environ["CRAWLEE_LOG_LEVEL"] = "INFO"
os.environ["CRAWLEE_PURGE_ON_START"] = "true"
from crawlee import Glob, ConcurrencySettings
from crawlee.crawlers import (
   BeautifulSoupCrawler,
   BeautifulSoupCrawlingContext,
   ParselCrawler,
   ParselCrawlingContext,
   PlaywrightCrawler,
   PlaywrightCrawlingContext,
)
try:
   import crawlee
   print("Crawlee version:", crawlee.__version__)
except Exception:
   print("Crawlee imported successfully.")
print("Pydantic version:", package_version("pydantic"))
def safe_slug(value):
   value = re.sub(r"[^a-zA-Z0-9]+", "-", str(value)).strip("-").lower()
   return value or "item"
def money_to_float(value):
   if value is None:
       return None
   cleaned = re.sub(r"[^0-9.]", "", str(value))
   return float(cleaned) if cleaned else None
def normalize_text(value, max_len=None):
   value = re.sub(r"\s+", " ", value or "").strip()
   return value[:max_len] if max_len else value
def write_file(path, content):
   path = Path(path)
   path.parent.mkdir(parents=True, exist_ok=True)
   path.write_text(textwrap.dedent(content).strip() + "\n", encoding="utf-8")

We begin by preparing the complete Colab runtime for the Crawlee tutorial. We install compatible versions of Crawlee, Pydantic, Playwright, and the required analysis libraries, and handle the automatic restart required after setup. We then configure storage folders, environment variables, crawler imports, and helper functions to ensure the rest of the workflow runs smoothly.

Generating the Demo Website and Product Catalog

PRODUCTS = [
   {
       "sku": "CRW-101",
       "name": "Crawler Reliability Kit",
       "category": "automation",
       "price": 149.0,
       "rating": 4.8,
       "stock": 18,
       "features": ["retry policy", "queue replay", "structured logs"],
       "related": ["CRW-202", "CRW-303"],
   },
   {
       "sku": "CRW-202",
       "name": "Playwright Rendering Pack",
       "category": "browser",
       "price": 249.0,
       "rating": 4.7,
       "stock": 9,
       "features": ["headless chromium", "screenshots", "dynamic DOM extraction"],
       "related": ["CRW-101", "CRW-404"],
   },
   {
       "sku": "CRW-303",
       "name": "RAG Extraction Bundle",
       "category": "ai-data",
       "price": 199.0,
       "rating": 4.9,
       "stock": 13,
       "features": ["clean text chunks", "metadata capture", "JSONL export"],
       "related": ["CRW-101", "CRW-505"],
   },
   {
       "sku": "CRW-404",
       "name": "Anti-Fragile Session Toolkit",
       "category": "resilience",
       "price": 299.0,
       "rating": 4.6,
       "stock": 5,
       "features": ["session rotation", "state recovery", "graceful failures"],
       "related": ["CRW-202", "CRW-505"],
   },
   {
       "sku": "CRW-505",
       "name": "Data Export Control Plane",
       "category": "storage",
       "price": 179.0,
       "rating": 4.5,
       "stock": 21,
       "features": ["datasets", "key-value store", "CSV and JSON export"],
       "related": ["CRW-303", "CRW-404"],
   },
]
def layout(title, body, extra_head="", extra_script=""):
   css = """
   <style>
     body {
       font-family: Inter, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
       margin: 0;
       background: #f7f7fb;
       color: #1f2430;
     }
     header {
       background: #202638;
       color: white;
       padding: 28px 40px;
     }
     nav a {
       color: #dbe7ff;
       margin-right: 18px;
       text-decoration: none;
       font-weight: 600;
     }
     main {
       max-width: 1050px;
       margin: 0 auto;
       padding: 32px;
     }
     .grid {
       display: grid;
       grid-template-columns: repeat(auto-fit, minmax(230px, 1fr));
       gap: 18px;
     }
     .card, article, .panel {
       background: white;
       border: 1px solid #e5e7ef;
       border-radius: 16px;
       padding: 20px;
       box-shadow: 0 8px 25px rgba(20, 30, 60, 0.05);
     }
     .price {
       font-size: 1.3rem;
       font-weight: 800;
     }
     .tag {
       display: inline-block;
       background: #edf2ff;
       border: 1px solid #d6e0ff;
       border-radius: 999px;
       padding: 4px 10px;
       margin: 3px;
       font-size: 0.82rem;
     }
     .stock-low {
       color: #b42318;
       font-weight: 700;
     }
     .stock-ok {
       color: #067647;
       font-weight: 700;
     }
     code, pre {
       background: #111827;
       color: #d1fae5;
       border-radius: 10px;
     }
     pre {
       padding: 16px;
       overflow-x: auto;
     }
     footer {
       padding: 30px 40px;
       color: #606779;
     }
   </style>
   """
   return f"""
   <!doctype html>
   <html lang="en">
     <head>
       <meta charset="utf-8">
       <meta name="viewport" content="width=device-width, initial-scale=1">
       <meta name="description" content="{title} page for a Crawlee Python tutorial demo website.">
       <title>{title}</title>
       {css}
       {extra_head}
     </head>
     <body>
       <header>
         <h1>{title}</h1>
         <nav>
           <a href="/index.html">Home</a>
           <a href="/products/product-crw-101.html">Products</a>
           <a href="/docs/getting-started.html">Docs</a>
           <a href="/blog/crawling-at-scale.html">Blog</a>
           <a href="/dynamic.html">Dynamic JS Page</a>
           <a href="/admin/hidden.html">Admin</a>
         </nav>
       </header>
       <main>{body}</main>
       <footer>Local demo website generated for Crawlee Python advanced tutorial.</footer>
       {extra_script}
     </body>
   </html>
   """
def build_demo_site():
   write_file(
       SITE_DIR / "robots.txt",
       """
       User-agent: *
       Disallow: /admin/
       Allow: /
       """,
   )
   product_cards = []
   for product in PRODUCTS:
       product_cards.append(
           f"""
           <div class="card product-teaser" data-sku="{product['sku']}" data-category="{product['category']}">
             <h2><a href="/products/product-{safe_slug(product['sku'])}.html">{product['name']}</a></h2>
             <p>{product['category']} crawler module with rating {product['rating']}.</p>
             <p class="price" data-price="{product['price']}">${product['price']:.2f}</p>
             <p class="{'stock-low' if product['stock'] < 10 else 'stock-ok'}">Stock: {product['stock']}</p>
           </div>
           """
       )
   write_file(
       SITE_DIR / "index.html",
       layout(
           "Crawlee Demo Commerce + Docs Hub",
           f"""
           <section class="panel">
             <h2>Why this site exists</h2>
             <p>
               This local website gives us predictable pages for testing Crawlee without scraping a third-party website.
               We include static HTML pages, documentation pages, product detail pages, a blog article, robots.txt,
               and a JavaScript-rendered page.
             </p>
           </section>
           <h2>Featured crawler modules</h2>
           <section class="grid">
             {''.join(product_cards)}
           </section>
           <section class="panel">
             <h2>Internal links for recursive crawling</h2>
             <ul>
               <li><a href="/docs/getting-started.html">Getting started guide</a></li>
               <li><a href="/docs/advanced-routing.html">Advanced routing guide</a></li>
               <li><a href="/blog/crawling-at-scale.html">Crawling at scale article</a></li>
               <li><a href="/dynamic.html">JavaScript-rendered catalog</a></li>
               <li><a href="/admin/hidden.html">Admin page blocked by robots and crawler filters</a></li>
             </ul>
           </section>
           """,
       ),
   )
   for product in PRODUCTS:
       related_links = "\n".join(
           f'<li><a class="related-link" href="/products/product-{safe_slug(sku)}.html">{sku}</a></li>'
           for sku in product["related"]
       )
       feature_list = "\n".join(f"<li>{feature}</li>" for feature in product["features"])
       json_ld = json.dumps(
           {
               "@context": "https://schema.org",
               "@type": "Product",
               "sku": product["sku"],
               "name": product["name"],
               "category": product["category"],
               "offers": {
                   "@type": "Offer",
                   "price": product["price"],
                   "priceCurrency": "USD",
               },
               "aggregateRating": {
                   "@type": "AggregateRating",
                   "ratingValue": product["rating"],
               },
           },
           indent=2,
       )
       write_file(
           SITE_DIR / "products" / f"product-{safe_slug(product['sku'])}.html",
           layout(
               f"{product['name']} | Product Detail",
               f"""
               <article class="product"
                        data-sku="{product['sku']}"
                        data-category="{product['category']}"
                        data-rating="{product['rating']}"
                        data-stock="{product['stock']}">
                 <h2 class="product-title">{product['name']}</h2>
                 <p class="sku">SKU: <strong>{product['sku']}</strong></p>
                 <p class="category">Category: <strong>{product['category']}</strong></p>
                 <p class="price" data-price="{product['price']}">${product['price']:.2f}</p>
                 <p class="rating">Rating: {product['rating']} / 5</p>
                 <p class="{'stock-low' if product['stock'] < 10 else 'stock-ok'}">Stock: {product['stock']}</p>
                 <h3>Features</h3>
                 <ul class="features">{feature_list}</ul>
                 <h3>Related modules</h3>
                 <ul>{related_links}</ul>
               </article>
               <script type="application/ld+json">{json_ld}</script>
               """,
           ),
       )

We create a realistic product catalog that becomes the structured data source for our demo website. We define reusable HTML layout logic, styling, navigation, and page templates to make the local website look and behave like a small commercial and documentation portal. We then generate the homepage and product detail pages, including prices, ratings, stock levels, product features, related links, and JSON-LD metadata.

Adding Docs, Blog, Dynamic, and Admin Pages

   write_file(
       SITE_DIR / "docs" / "getting-started.html",
       layout(
           "Getting Started with Reliable Crawlers",
           """
           <article class="doc" data-doc-id="getting-started">
             <h2>HTTP-first crawling strategy</h2>
             <p>
               We start with HTTP crawlers because they are lightweight and efficient.
               Browser crawling is reserved for pages that need JavaScript rendering.
             </p>
             <h2>Core extraction fields</h2>
             <p>
               Each crawler extracts URL, title, page type, text summary, outgoing links, and page-specific metadata.
             </p>
             <pre><code>crawler = BeautifulSoupCrawler(max_requests_per_crawl=20)</code></pre>
             <p><a href="/docs/advanced-routing.html">Next: advanced routing</a></p>
           </article>
           """,
       ),
   )
   write_file(
       SITE_DIR / "docs" / "advanced-routing.html",
       layout(
           "Advanced Routing and Storage",
           """
           <article class="doc" data-doc-id="advanced-routing">
             <h2>Queue filtering</h2>
             <p>
               We filter links to keep the crawl focused on the same local domain and skip admin pages.
             </p>
             <h2>Storage design</h2>
             <p>
               Structured rows go to datasets. Binary screenshots and snapshots go to a key-value store.
             </p>
             <pre><code>await context.enqueue_links(include=[Glob("https://example.com/**")])</code></pre>
             <p><a href="/blog/crawling-at-scale.html">Read the scaling article</a></p>
           </article>
           """,
       ),
   )
   write_file(
       SITE_DIR / "blog" / "crawling-at-scale.html",
       layout(
           "Crawling at Scale",
           """
           <article class="blog-post" data-author="demo-team" data-reading-time="7">
             <h2>Scaling crawler jobs without losing reliability</h2>
             <p>
               Production crawlers need controlled concurrency, retry behavior, stable request queues,
               structured exports, and monitoring-ready output.
             </p>
             <p>
               For AI data workflows, we also normalize text, preserve source URLs, create chunks,
               and record extraction provenance.
             </p>
             <span class="tag">queues</span>
             <span class="tag">datasets</span>
             <span class="tag">rag</span>
             <span class="tag">playwright</span>
           </article>
           """,
       ),
   )
   dynamic_items = json.dumps(
       [
           {
               "sku": "JS-900",
               "name": "Dynamic Inventory Scanner",
               "price": 329.0,
               "stock": 4,
               "desc": "Rendered only after JavaScript executes.",
           },
           {
               "sku": "JS-901",
               "name": "Client-Side Review Miner",
               "price": 279.0,
               "stock": 11,
               "desc": "Created by browser-side DOM manipulation.",
           },
           {
               "sku": "JS-902",
               "name": "Async Catalog Watcher",
               "price": 389.0,
               "stock": 7,
               "desc": "Useful for testing PlaywrightCrawler extraction.",
           },
       ],
       indent=2,
   )
   dynamic_script = f"""
   <script>
     const dynamicItems = {dynamic_items};
     function renderItems() {{
       const root = document.querySelector("#dynamic-products");
       root.innerHTML = "";
       for (const item of dynamicItems) {{
         const card = document.createElement("div");
         card.className = "card js-card";
         card.dataset.sku = item.sku;
         card.dataset.price = item.price;
         card.dataset.stock = item.stock;
         card.innerHTML = `
           <h3>${{item.name}}</h3>
           <p class="desc">${{item.desc}}</p>
           <p class="price">$${{item.price.toFixed(2)}}</p>
           <p class="${{item.stock < 8 ? "stock-low" : "stock-ok"}}">Stock: ${{item.stock}}</p>
         `;
         root.appendChild(card);
       }}
       document.querySelector("#render-status").textContent =
         "Rendered " + dynamicItems.length + " JavaScript items.";
     }}
     setTimeout(renderItems, 600);
   </script>
   """
   write_file(
       SITE_DIR / "dynamic.html",
       layout(
           "JavaScript Rendered Catalog",
           """
           <section class="panel">
             <h2>Dynamic content test</h2>
             <p>
               A plain HTTP crawler can download this page, but it will not see the cards below until JavaScript runs.
               PlaywrightCrawler opens a real browser and extracts the rendered DOM.
             </p>
             <p id="render-status">Waiting for JavaScript rendering...</p>
           </section>
           <section id="dynamic-products" class="grid"></section>
           """,
           extra_script=dynamic_script,
       ),
   )
   write_file(
       SITE_DIR / "admin" / "hidden.html",
       layout(
           "Hidden Admin Page",
           """
           <article class="panel">
             <h2>This page should be skipped</h2>
             <p>
               The crawler excludes this admin path to demonstrate control over the rawl scope 
             </p>
           </article>
           """,
       ),
   )
build_demo_site()
print(f"Demo site generated at: {SITE_DIR}")
class QuietHandler(SimpleHTTPRequestHandler):
   def log_message(self, format, *args):
       pass
def start_local_server(directory):
   probe = socket.socket()
   probe.bind(("127.0.0.1", 0))
   port = probe.getsockname()[1]
   probe.close()
   handler = partial(QuietHandler, directory=str(directory))
   httpd = ThreadingHTTPServer(("127.0.0.1", port), handler)
   thread = threading.Thread(target=httpd.serve_forever, daemon=True)
   thread.start()
   base_url = f"http://127.0.0.1:{port}"
   time.sleep(0.5)
   return httpd, base_url
def extract_json_ld(soup):
   blocks = []
   for script in soup.select('script[type="application/ld+json"]'):
       raw = script.string or script.get_text()
       if not raw:
           continue
       try:
           blocks.append(json.loads(raw))
       except Exception:
           blocks.append({"raw": raw})
   return blocks
def write_json(path, rows):
   path = Path(path)
   path.write_text(json.dumps(rows, ensure_ascii=False, indent=2), encoding="utf-8")
def write_csv(path, rows):
   path = Path(path)
   if not rows:
       path.write_text("", encoding="utf-8")
       return
   flattened = []
   for row in rows:
       flat = {}
       for key, value in row.items():
           if isinstance(value, (list, dict)):
               flat[key] = json.dumps(value, ensure_ascii=False)
           else:
               flat[key] = value
       flattened.append(flat)
   fieldnames = sorted({key for row in flattened for key in row.keys()})
   with path.open("w", newline="", encoding="utf-8") as f:
       writer = csv.DictWriter(f, fieldnames=fieldnames)
       writer.writeheader()
       writer.writerows(flattened)

We expand the demo website by adding documentation pages, a blog article, a JavaScript-rendered catalog page, and an admin page intended to be excluded from crawling. We use these pages to test different crawling scenarios, including static HTML extraction, documentation parsing, blog metadata extraction, dynamic browser rendering, and crawl filtering. We also start a local HTTP server and define utilities to extract JSON-LD content and export crawl results to JSON and CSV.

Static Crawling with BeautifulSoupCrawler and ParselCrawler

async def run_beautifulsoup_crawl(base_url):
   print("\n=== 1) BeautifulSoupCrawler: fast recursive HTTP crawl ===")
   rows = []
   crawler = BeautifulSoupCrawler(
       parser="html.parser",
       max_requests_per_crawl=30,
       max_request_retries=1,
       respect_robots_txt_file=True,
       concurrency_settings=ConcurrencySettings(
           desired_concurrency=4,
           max_concurrency=6,
       ),
   )
   @crawler.router.default_handler
   async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
       soup = context.soup
       url = context.request.url
       title = normalize_text(soup.title.get_text(" ", strip=True) if soup.title else "")
       meta_description = ""
       meta_tag = soup.find("meta", attrs={"name": "description"})
       if meta_tag:
           meta_description = normalize_text(meta_tag.get("content", ""))
       out_links = []
       for a in soup.select("a[href]"):
           href = a.get("href")
           label = normalize_text(a.get_text(" ", strip=True), 120)
           out_links.append({"href": href, "label": label})
       page_text = normalize_text(soup.get_text(" ", strip=True), 1000)
       if "/products/" in url:
           page_type = "product"
       elif "/docs/" in url:
           page_type = "documentation"
       elif "/blog/" in url:
           page_type = "blog"
       elif "/dynamic" in url:
           page_type = "dynamic-shell"
       else:
           page_type = "index"
       row = {
           "source": "beautifulsoup-http",
           "url": url,
           "title": title,
           "page_type": page_type,
           "meta_description": meta_description,
           "text_preview": page_text,
           "out_links": out_links,
           "json_ld": extract_json_ld(soup),
           "extracted_at_unix": time.time(),
       }
       if page_type == "product":
           article = soup.select_one("article.product")
           if article:
               price_node = soup.select_one(".price")
               row["product"] = {
                   "sku": article.get("data-sku"),
                   "category": article.get("data-category"),
                   "name": normalize_text(
                       soup.select_one(".product-title").get_text(" ", strip=True)
                       if soup.select_one(".product-title")
                       else ""
                   ),
                   "price": money_to_float(price_node.get("data-price") if price_node else None),
                   "rating": float(article.get("data-rating")) if article.get("data-rating") else None,
                   "stock": int(article.get("data-stock")) if article.get("data-stock") else None,
                   "features": [
                       normalize_text(li.get_text(" ", strip=True))
                       for li in soup.select(".features li")
                   ],
               }
       if page_type == "documentation":
           row["doc"] = {
               "headings": [
                   normalize_text(h.get_text(" ", strip=True))
                   for h in soup.select("h2, h3")
               ],
               "code_blocks": [
                   normalize_text(code.get_text(" ", strip=True))
                   for code in soup.select("pre code")
               ],
           }
       if page_type == "blog":
           row["blog"] = {
               "author": soup.select_one(".blog-post").get("data-author") if soup.select_one(".blog-post") else None,
               "reading_time": soup.select_one(".blog-post").get("data-reading-time") if soup.select_one(".blog-post") else None,
               "tags": [
                   normalize_text(tag.get_text(" ", strip=True))
                   for tag in soup.select(".tag")
               ],
           }
       rows.append(row)
       await context.push_data(row)
       await context.enqueue_links(
           include=[Glob(f"{base_url}/**")],
           exclude=[
               Glob(f"{base_url}/admin/**"),
               Glob(f"{base_url}/dynamic.html"),
           ],
       )
   await crawler.run([f"{base_url}/index.html"])
   write_json(OUTPUT_DIR / "beautifulsoup_crawl.json", rows)
   write_csv(OUTPUT_DIR / "beautifulsoup_crawl.csv", rows)
   print(f"BeautifulSoup rows extracted: {len(rows)}")
   return rows
async def run_parsel_precision_crawl(base_url):
   print("\n=== 2) ParselCrawler: precise CSS/XPath extraction from product pages ===")
   rows = []
   product_urls = [
       f"{base_url}/products/product-{safe_slug(product['sku'])}.html"
       for product in PRODUCTS
   ]
   crawler = ParselCrawler(
       max_requests_per_crawl=len(product_urls),
       max_request_retries=1,
       concurrency_settings=ConcurrencySettings(
           desired_concurrency=5,
           max_concurrency=8,
       ),
   )
   @crawler.router.default_handler
   async def request_handler(context: ParselCrawlingContext) -> None:
       selector = context.selector
       title = selector.css("title::text").get()
       sku = selector.css("article.product::attr(data-sku)").get()
       category = selector.css("article.product::attr(data-category)").get()
       rating = selector.css("article.product::attr(data-rating)").get()
       stock = selector.css("article.product::attr(data-stock)").get()
       name = selector.css(".product-title::text").get()
       price = selector.css(".price::attr(data-price)").get()
       features = [
           normalize_text(feature)
           for feature in selector.css(".features li::text").getall()
       ]
       row = {
           "source": "parsel-precision",
           "url": context.request.url,
           "title": normalize_text(title),
           "sku": sku,
           "name": normalize_text(name),
           "category": category,
           "price": money_to_float(price),
           "rating": float(rating) if rating else None,
           "stock": int(stock) if stock else None,
           "features": features,
           "xpath_title": normalize_text(selector.xpath("//title/text()").get()),
       }
       rows.append(row)
       await context.push_data(row)
   await crawler.run(product_urls)
   write_json(OUTPUT_DIR / "parsel_products.json", rows)
   write_csv(OUTPUT_DIR / "parsel_products.csv", rows)
   print(f"Parsel product rows extracted: {len(rows)}")
   return rows

We implement the static crawling part of the workflow using BeautifulSoupCrawler and ParselCrawler. With BeautifulSoupCrawler, we recursively crawl the local website and extract page titles, metadata, text previews, outgoing links, product details, documentation headings, code blocks, and blog tags. With ParselCrawler, we perform more targeted CSS and XPath extraction from product pages to collect clean product-level fields, including SKU, category, price, rating, stock, and features.

Dynamic Rendering with PlaywrightCrawler and Link Graphs

async def run_playwright_dynamic_crawl(base_url):
   print("\n=== 3) PlaywrightCrawler: browser-rendered JavaScript crawl ===")
   rows = []
   crawler = PlaywrightCrawler(
       max_requests_per_crawl=2,
       max_request_retries=1,
       headless=True,
       browser_type="chromium",
       browser_launch_options={
           "args": ["--no-sandbox", "--disable-dev-shm-usage"],
       },
       goto_options={
           "wait_until": "domcontentloaded",
       },
       concurrency_settings=ConcurrencySettings(
           desired_concurrency=1,
           max_concurrency=2,
       ),
   )
   @crawler.router.default_handler
   async def request_handler(context: PlaywrightCrawlingContext) -> None:
       await context.page.wait_for_selector(".js-card", timeout=10000)
       cards = await context.page.locator(".js-card").evaluate_all(
           """
           (cards) => cards.map((card) => {
             const h3 = card.querySelector("h3");
             const desc = card.querySelector(".desc");
             const price = card.querySelector(".price");
             return {
               sku: card.dataset.sku,
               name: h3 ? h3.textContent.trim() : null,
               description: desc ? desc.textContent.trim() : null,
               price_text: price ? price.textContent.trim() : null,
               price: Number(card.dataset.price),
               stock: Number(card.dataset.stock),
               rendered_text: card.innerText.trim()
             };
           })
           """
       )
       screenshot_bytes = await context.page.screenshot(full_page=True)
       screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
       screenshot_path.write_bytes(screenshot_bytes)
       try:
           kvs = await context.get_key_value_store()
           await kvs.set_value(
               key="dynamic-catalog-full-page",
               value=screenshot_bytes,
               content_type="image/png",
           )
       except Exception as exc:
           print("Key-value store screenshot save skipped:", repr(exc))
       for card in cards:
           row = {
               **card,
               "source": "playwright-rendered-js",
               "url": context.request.url,
               "screenshot_path": str(screenshot_path),
               "extracted_at_unix": time.time(),
           }
           rows.append(row)
       await context.push_data(rows)
   try:
       await crawler.run([f"{base_url}/dynamic.html"])
   except Exception as exc:
       print("Playwright section failed gracefully.")
       print("Reason:", repr(exc))
   write_json(OUTPUT_DIR / "playwright_dynamic.json", rows)
   write_csv(OUTPUT_DIR / "playwright_dynamic.csv", rows)
   print(f"Playwright dynamic rows extracted: {len(rows)}")
   return rows
def flatten_products(rows):
   products = []
   for row in rows:
       if row.get("page_type") == "product" and isinstance(row.get("product"), dict):
           product = row["product"]
           products.append(
               {
                   "source": row.get("source"),
                   "url": row.get("url"),
                   "sku": product.get("sku"),
                   "name": product.get("name"),
                   "category": product.get("category"),
                   "price": product.get("price"),
                   "rating": product.get("rating"),
                   "stock": product.get("stock"),
                   "features": "; ".join(product.get("features", [])),
               }
           )
       elif row.get("source") == "parsel-precision":
           products.append(
               {
                   "source": row.get("source"),
                   "url": row.get("url"),
                   "sku": row.get("sku"),
                   "name": row.get("name"),
                   "category": row.get("category"),
                   "price": row.get("price"),
                   "rating": row.get("rating"),
                   "stock": row.get("stock"),
                   "features": "; ".join(row.get("features", [])),
               }
           )
       elif row.get("source") == "playwright-rendered-js":
           products.append(
               {
                   "source": row.get("source"),
                   "url": row.get("url"),
                   "sku": row.get("sku"),
                   "name": row.get("name"),
                   "category": "dynamic-js",
                   "price": row.get("price") or money_to_float(row.get("price_text")),
                   "rating": None,
                   "stock": row.get("stock"),
                   "features": row.get("description"),
               }
           )
   return products
def absolute_url(base_url, href):
   if not href:
       return None
   if href.startswith("http://") or href.startswith("https://"):
       return href
   if href.startswith("/"):
       return base_url + href
   return base_url + "/" + href
def build_link_graph(base_url, rows):
   graph = nx.DiGraph()
   for row in rows:
       src = row.get("url")
       if not src:
           continue
       graph.add_node(
           src,
           title=row.get("title", ""),
           page_type=row.get("page_type", ""),
       )
       for link in row.get("out_links", []) or []:
           dst = absolute_url(base_url, link.get("href"))
           if not dst:
               continue
           if "/admin/" in dst:
               continue
           graph.add_node(dst)
           graph.add_edge(src, dst, label=link.get("label", ""))
   return graph

We handle dynamic content using PlaywrightCrawler, which opens the JavaScript-rendered page in a headless Chromium browser. We wait for client-side product cards to appear, extract their rendered fields, capture a full-page screenshot, and save the browser-based results for later analysis. We then define helper functions to normalize product records and build a directed link graph from the internal links discovered during crawling.

Building AI-Ready Outputs and Running the Pipeline

def make_rag_chunks(rows, max_chars=700):
   chunks = []
   for row in rows:
       text = (
           row.get("text_preview")
           or row.get("rendered_text")
           or row.get("description")
           or ""
       )
       text = normalize_text(text)
       if not text:
           continue
       sentences = re.split(r"(?<=[.!?])\s+", text)
       current = ""
       for sentence in sentences:
           if len(current) + len(sentence) + 1 <= max_chars:
               current = (current + " " + sentence).strip()
           else:
               if current:
                   chunks.append(
                       {
                           "chunk_id": hashlib.sha1(
                               (row.get("url", "") + current).encode()
                           ).hexdigest()[:12],
                           "url": row.get("url"),
                           "source": row.get("source"),
                           "page_type": row.get("page_type"),
                           "title": row.get("title") or row.get("name"),
                           "text": current,
                       }
                   )
               current = sentence
       if current:
           chunks.append(
               {
                   "chunk_id": hashlib.sha1(
                       (row.get("url", "") + current).encode()
                   ).hexdigest()[:12],
                   "url": row.get("url"),
                   "source": row.get("source"),
                   "page_type": row.get("page_type"),
                   "title": row.get("title") or row.get("name"),
                   "text": current,
               }
           )
   return chunks
def analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows):
   all_rows = bs4_rows + parsel_rows + playwright_rows
   products = flatten_products(all_rows)
   crawl_df = pd.DataFrame(all_rows)
   product_df = pd.DataFrame(products)
   if not product_df.empty:
       product_df["price"] = pd.to_numeric(product_df["price"], errors="coerce")
       product_df["stock"] = pd.to_numeric(product_df["stock"], errors="coerce")
       product_df["rating"] = pd.to_numeric(product_df["rating"], errors="coerce")
       product_df["inventory_value"] = product_df["price"] * product_df["stock"]
   graph = build_link_graph(base_url, bs4_rows)
   graph_path = OUTPUT_DIR / "site_link_graph.graphml"
   if graph.number_of_nodes() > 0:
       nx.write_graphml(graph, graph_path)
   chunks = make_rag_chunks(all_rows)
   rag_path = OUTPUT_DIR / "rag_chunks.jsonl"
   with rag_path.open("w", encoding="utf-8") as f:
       for chunk in chunks:
           f.write(json.dumps(chunk, ensure_ascii=False) + "\n")
   crawl_json_path = OUTPUT_DIR / "combined_crawl_results.json"
   crawl_json_path.write_text(
       json.dumps(all_rows, ensure_ascii=False, indent=2),
       encoding="utf-8",
   )
   product_csv_path = OUTPUT_DIR / "normalized_product_catalog.csv"
   if not product_df.empty:
       product_df.to_csv(product_csv_path, index=False)
   price_plot_path = OUTPUT_DIR / "product_price_chart.png"
   if not product_df.empty and product_df["price"].notna().any():
       plot_df = product_df.dropna(subset=["price"]).copy()
       plot_df["label"] = plot_df["sku"].fillna("unknown") + "\n" + plot_df["source"].fillna("")
       ax = plot_df.plot(
           kind="bar",
           x="label",
           y="price",
           legend=False,
           figsize=(11, 5),
           title="Extracted Product Prices by Source",
       )
       ax.set_xlabel("Product / extraction source")
       ax.set_ylabel("Price")
       plt.xticks(rotation=35, ha="right")
       plt.tight_layout()
       plt.savefig(price_plot_path, dpi=160)
       plt.show()
   graph_stats = {
       "nodes": graph.number_of_nodes(),
       "edges": graph.number_of_edges(),
       "weakly_connected_components": (
           nx.number_weakly_connected_components(graph)
           if graph.number_of_nodes()
           else 0
       ),
   }
   if graph.number_of_nodes() > 0:
       in_degrees = dict(graph.in_degree())
       out_degrees = dict(graph.out_degree())
       graph_stats["top_in_degree"] = sorted(
           in_degrees.items(),
           key=lambda x: x[1],
           reverse=True,
       )[:5]
       graph_stats["top_out_degree"] = sorted(
           out_degrees.items(),
           key=lambda x: x[1],
           reverse=True,
       )[:5]
   summary = {
       "base_url": base_url,
       "rows_total": len(all_rows),
       "beautifulsoup_rows": len(bs4_rows),
       "parsel_rows": len(parsel_rows),
       "playwright_rows": len(playwright_rows),
       "products_total": len(product_df),
       "rag_chunks_total": len(chunks),
       "graph": graph_stats,
       "outputs": {
           "beautifulsoup_json": str(OUTPUT_DIR / "beautifulsoup_crawl.json"),
           "beautifulsoup_csv": str(OUTPUT_DIR / "beautifulsoup_crawl.csv"),
           "parsel_json": str(OUTPUT_DIR / "parsel_products.json"),
           "parsel_csv": str(OUTPUT_DIR / "parsel_products.csv"),
           "playwright_json": str(OUTPUT_DIR / "playwright_dynamic.json"),
           "playwright_csv": str(OUTPUT_DIR / "playwright_dynamic.csv"),
           "combined_json": str(crawl_json_path),
           "product_csv": str(product_csv_path) if product_csv_path.exists() else None,
           "rag_jsonl": str(rag_path),
           "graphml": str(graph_path) if graph_path.exists() else None,
           "price_plot": str(price_plot_path) if price_plot_path.exists() else None,
           "screenshots_dir": str(SCREENSHOT_DIR),
       },
   }
   summary_path = OUTPUT_DIR / "run_summary.md"
   summary_path.write_text(
       "# Crawlee Python Advanced Tutorial Run Summary\n\n"
       f"- Local demo site: `{base_url}`\n"
       f"- Total extracted rows: `{summary['rows_total']}`\n"
       f"- BeautifulSoup rows: `{summary['beautifulsoup_rows']}`\n"
       f"- Parsel rows: `{summary['parsel_rows']}`\n"
       f"- Playwright rows: `{summary['playwright_rows']}`\n"
       f"- Normalized products: `{summary['products_total']}`\n"
       f"- RAG chunks: `{summary['rag_chunks_total']}`\n"
       f"- Link graph nodes: `{graph_stats['nodes']}`\n"
       f"- Link graph edges: `{graph_stats['edges']}`\n\n"
       "## Output files\n\n"
       + "\n".join(f"- `{k}`: `{v}`" for k, v in summary["outputs"].items())
       + "\n",
       encoding="utf-8",
   )
   print("\n=== 4) Analysis summary ===")
   print(json.dumps(summary, indent=2, ensure_ascii=False))
   try:
       from IPython.display import display, Markdown, Image as IPImage
       display(Markdown("## Crawlee crawl preview"))
       if not crawl_df.empty:
           preview_cols = [
               col for col in ["source", "page_type", "title", "url"]
               if col in crawl_df.columns
           ]
           display(crawl_df[preview_cols].head(12))
       display(Markdown("## Normalized product catalog"))
       if not product_df.empty:
           display(product_df.head(20))
       if price_plot_path.exists():
           display(Markdown("## Product price chart"))
           display(IPImage(filename=str(price_plot_path)))
       screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
       if screenshot_path.exists():
           display(Markdown("## Playwright screenshot of JavaScript-rendered page"))
           display(IPImage(filename=str(screenshot_path)))
       display(Markdown(f"## Output directory\n`{OUTPUT_DIR}`"))
   except Exception as exc:
       print("Notebook display skipped:", repr(exc))
   return summary
async def main():
   httpd, base_url = start_local_server(SITE_DIR)
   print(f"\nLocal demo website is running at: {base_url}/index.html")
   try:
       bs4_rows = await run_beautifulsoup_crawl(base_url)
       parsel_rows = await run_parsel_precision_crawl(base_url)
       playwright_rows = await run_playwright_dynamic_crawl(base_url)
       summary = analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows)
       return summary
   finally:
       httpd.shutdown()
       print("\nLocal demo server shut down.")
loop = asyncio.get_event_loop()
summary = loop.run_until_complete(main())
print("\nTutorial complete.")
print(f"All outputs are in: {OUTPUT_DIR}")
print("Key files:")
for file_path in sorted(OUTPUT_DIR.rglob("*")):
   if file_path.is_file():
       print(" -", file_path)

We process the extracted crawl data into analysis-ready and AI-ready outputs. We create RAG-style JSONL chunks, combine all crawl results, build a normalized product catalog, generate a GraphML link graph, and visualize product prices with Matplotlib. Finally, we run the full pipeline end-to-end, display previews in the notebook, save all generated artifacts, and print the final output file paths.

Conclusion

In conclusion, we have a complete Crawlee-based pipeline for crawling and data engineering that converts a small website into structured, reusable datasets. We used crawl scoping, robots.txt handling, concurrency settings, link enqueuing, browser rendering, key-value storage, and dataset exports to simulate patterns used in production web crawling systems. We normalized the extracted product data, saved the crawl outputs as JSON and CSV, created GraphML link graphs with NetworkX, generated JSONL chunks for retrieval-augmented generation workflows, and visualized the extracted product prices with Matplotlib.


Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.