Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export

MarkTechPost

Sana Hassan · 2026-06-21 · via MarkTechPost

In this tutorial, we build a full Crawlee-for-Python workflow that covers environment setup, local website generation, static crawling, dynamic crawling, structured extraction, and downstream data processing. We begin by configuring a compatible Crawlee runtime with pinned Pydantic support, Playwright browser installation, persistent storage directories, and Colab-safe execution handling. We then generate a realistic local demo website containing product pages, documentation pages, blog content, internal links, robots.txt rules, JSON-LD metadata, and JavaScript-rendered catalog items. Using BeautifulSoupCrawler, we perform fast recursive HTML crawling and extract page titles, metadata, text previews, outgoing links, product attributes, documentation headings, code blocks, and blog tags. With ParselCrawler, we run precise CSS- and XPath-based extraction on product detail pages. With PlaywrightCrawler, we render JavaScript content in a headless Chromium browser, wait for dynamic DOM elements to appear, extract client-side data, and capture full-page screenshots.

Setting Up the Crawlee Python Runtime and Helpers

import os
import sys
import re
import csv
import json
import time
import math
import shutil
import socket
import hashlib
import asyncio
import textwrap
import subprocess
import threading
from pathlib import Path
from functools import partial
from http.server import ThreadingHTTPServer, SimpleHTTPRequestHandler
from importlib.metadata import version, PackageNotFoundError
SETUP_SENTINEL = "/content/.crawlee_python_tutorial_setup_done_v2"
def sh(command, check=True, quiet=False):
   print(f"\n$ {command}")
   result = subprocess.run(
       command,
       shell=True,
       text=True,
       stdout=subprocess.PIPE,
       stderr=subprocess.STDOUT,
   )
   if not quiet and result.stdout:
       print(result.stdout[-5000:])
   if check and result.returncode != 0:
       raise RuntimeError(f"Command failed with exit code {result.returncode}: {command}")
   return result.returncode == 0
def package_version(package_name):
   try:
       return version(package_name)
   except PackageNotFoundError:
       return None
def is_good_pydantic_version(v):
   if not v:
       return False
   m = re.match(r"^(\d+)\.(\d+)", v)
   if not m:
       return False
   major, minor = int(m.group(1)), int(m.group(2))
   return major == 2 and minor == 11
current_crawlee = package_version("crawlee")
current_pydantic = package_version("pydantic")
needs_setup = (
   not os.path.exists(SETUP_SENTINEL)
   or current_crawlee is None
   or not is_good_pydantic_version(current_pydantic)
)
if needs_setup:
   print("PHASE 1: Installing compatible Crawlee + Pydantic + Playwright dependencies.")
   print("After this finishes, Colab will restart automatically. Then run this same cell again.")
   sh(f'{sys.executable} -m pip uninstall -y crawlee pydantic pydantic-core', check=False)
   sh(
       f'{sys.executable} -m pip install -q -U '
       f'"pydantic>=2.11,<2.12" '
       f'"crawlee[all]" '
       f'pandas matplotlib networkx nest_asyncio beautifulsoup4 parsel'
   )
   sh(f'{sys.executable} -m playwright install --with-deps chromium', check=False)
   Path(SETUP_SENTINEL).write_text("done", encoding="utf-8")
   print("\nInstalled versions:")
   sh(f'{sys.executable} -m pip show crawlee pydantic pydantic-core', check=False)
   try:
       import google.colab
       print("\nRestarting Colab runtime now. After it reconnects, run this same cell again.")
       os.kill(os.getpid(), 9)
   except Exception:
       raise SystemExit("Setup complete. Restart the runtime/kernel manually, then run this cell again.")
print("PHASE 2: Dependencies are ready. Running the Crawlee tutorial.")
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import nest_asyncio
nest_asyncio.apply()
TUTORIAL_ROOT = Path("/content/crawlee_python_advanced_tutorial")
SITE_DIR = TUTORIAL_ROOT / "demo_site"
OUTPUT_DIR = TUTORIAL_ROOT / "outputs"
STORAGE_DIR = TUTORIAL_ROOT / "crawlee_storage"
SCREENSHOT_DIR = OUTPUT_DIR / "screenshots"
for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR]:
   if path.exists():
       shutil.rmtree(path)
for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR, SCREENSHOT_DIR]:
   path.mkdir(parents=True, exist_ok=True)
os.environ["CRAWLEE_STORAGE_DIR"] = str(STORAGE_DIR)
os.environ["CRAWLEE_LOG_LEVEL"] = "INFO"
os.environ["CRAWLEE_PURGE_ON_START"] = "true"
from crawlee import Glob, ConcurrencySettings
from crawlee.crawlers import (
   BeautifulSoupCrawler,
   BeautifulSoupCrawlingContext,
   ParselCrawler,
   ParselCrawlingContext,
   PlaywrightCrawler,
   PlaywrightCrawlingContext,
)
try:
   import crawlee
   print("Crawlee version:", crawlee.__version__)
except Exception:
   print("Crawlee imported successfully.")
print("Pydantic version:", package_version("pydantic"))
def safe_slug(value):
   value = re.sub(r"[^a-zA-Z0-9]+", "-", str(value)).strip("-").lower()
   return value or "item"
def money_to_float(value):
   if value is None:
       return None
   cleaned = re.sub(r"[^0-9.]", "", str(value))
   return float(cleaned) if cleaned else None
def normalize_text(value, max_len=None):
   value = re.sub(r"\s+", " ", value or "").strip()
   return value[:max_len] if max_len else value
def write_file(path, content):
   path = Path(path)
   path.parent.mkdir(parents=True, exist_ok=True)
   path.write_text(textwrap.dedent(content).strip() + "\n", encoding="utf-8")

We begin by preparing the complete Colab runtime for the Crawlee tutorial. We install compatible versions of Crawlee, Pydantic, Playwright, and the required analysis libraries, and handle the automatic restart required after setup. We then configure storage folders, environment variables, crawler imports, and helper functions to ensure the rest of the workflow runs smoothly.

Generating the Demo Website and Product Catalog

PRODUCTS = [
   {
       "sku": "CRW-101",
       "name": "Crawler Reliability Kit",
       "category": "automation",
       "price": 149.0,
       "rating": 4.8,
       "stock": 18,
       "features": ["retry policy", "queue replay", "structured logs"],
       "related": ["CRW-202", "CRW-303"],
   },
   {
       "sku": "CRW-202",
       "name": "Playwright Rendering Pack",
       "category": "browser",
       "price": 249.0,
       "rating": 4.7,
       "stock": 9,
       "features": ["headless chromium", "screenshots", "dynamic DOM extraction"],
       "related": ["CRW-101", "CRW-404"],
   },
   {
       "sku": "CRW-303",
       "name": "RAG Extraction Bundle",
       "category": "ai-data",
       "price": 199.0,
       "rating": 4.9,
       "stock": 13,
       "features": ["clean text chunks", "metadata capture", "JSONL export"],
       "related": ["CRW-101", "CRW-505"],
   },
   {
       "sku": "CRW-404",
       "name": "Anti-Fragile Session Toolkit",
       "category": "resilience",
       "price": 299.0,
       "rating": 4.6,
       "stock": 5,
       "features": ["session rotation", "state recovery", "graceful failures"],
       "related": ["CRW-202", "CRW-505"],
   },
   {
       "sku": "CRW-505",
       "name": "Data Export Control Plane",
       "category": "storage",
       "price": 179.0,
       "rating": 4.5,
       "stock": 21,
       "features": ["datasets", "key-value store", "CSV and JSON export"],
       "related": ["CRW-303", "CRW-404"],
   },
]
def layout(title, body, extra_head="", extra_script=""):
   css = """
   <style>
     body {
       font-family: Inter, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
       margin: 0;
       background: #f7f7fb;
       color: #1f2430;
     }
     header {
       background: #202638;
       color: white;
       padding: 28px 40px;
     }
     nav a {
       color: #dbe7ff;
       margin-right: 18px;
       text-decoration: none;
       font-weight: 600;
     }
     main {
       max-width: 1050px;
       margin: 0 auto;
       padding: 32px;
     }
     .grid {
       display: grid;
       grid-template-columns: repeat(auto-fit, minmax(230px, 1fr));
       gap: 18px;
     }
     .card, article, .panel {
       background: white;
       border: 1px solid #e5e7ef;
       border-radius: 16px;
       padding: 20px;
       box-shadow: 0 8px 25px rgba(20, 30, 60, 0.05);
     }
     .price {
       font-size: 1.3rem;
       font-weight: 800;
     }
     .tag {
       display: inline-block;
       background: #edf2ff;
       border: 1px solid #d6e0ff;
       border-radius: 999px;
       padding: 4px 10px;
       margin: 3px;
       font-size: 0.82rem;
     }
     .stock-low {
       color: #b42318;
       font-weight: 700;
     }
     .stock-ok {
       color: #067647;
       font-weight: 700;
     }
     code, pre {
       background: #111827;
       color: #d1fae5;
       border-radius: 10px;
     }
     pre {
       padding: 16px;
       overflow-x: auto;
     }
     footer {
       padding: 30px 40px;
       color: #606779;
     }
   </style>
   """
   return f"""
   <!doctype html>
   <html lang="en">
     <head>
       <meta charset="utf-8">
       <meta name="viewport" content="width=device-width, initial-scale=1">
       <meta name="description" content="{title} page for a Crawlee Python tutorial demo website.">
       <title>{title}</title>
       {css}
       {extra_head}
     </head>
     <body>
       <header>
         <h1>{title}</h1>
         <nav>
           <a href="/index.html">Home</a>
           <a href="/products/product-crw-101.html">Products</a>
           <a href="/docs/getting-started.html">Docs</a>
           <a href="/blog/crawling-at-scale.html">Blog</a>
           <a href="/dynamic.html">Dynamic JS Page</a>
           <a href="/admin/hidden.html">Admin</a>
         </nav>
       </header>
       <main>{body}</main>
       <footer>Local demo website generated for Crawlee Python advanced tutorial.</footer>
       {extra_script}
     </body>
   </html>
   """
def build_demo_site():
   write_file(
       SITE_DIR / "robots.txt",
       """
       User-agent: *
       Disallow: /admin/
       Allow: /
       """,
   )
   product_cards = []
   for product in PRODUCTS:
       product_cards.append(
           f"""
           <div class="card product-teaser" data-sku="{product['sku']}" data-category="{product['category']}">
             <h2><a href="/products/product-{safe_slug(product['sku'])}.html">{product['name']}</a></h2>
             <p>{product['category']} crawler module with rating {product['rating']}.</p>
             <p class="price" data-price="{product['price']}">${product['price']:.2f}</p>
             <p class="{'stock-low' if product['stock'] < 10 else 'stock-ok'}">Stock: {product['stock']}</p>
           </div>
           """
       )
   write_file(
       SITE_DIR / "index.html",
       layout(
           "Crawlee Demo Commerce + Docs Hub",
           f"""
           <section class="panel">
             <h2>Why this site exists</h2>
             <p>
               This local website gives us predictable pages for testing Crawlee without scraping a third-party website.
               We include static HTML pages, documentation pages, product detail pages, a blog article, robots.txt,
               and a JavaScript-rendered page.
             </p>
           </section>
           <h2>Featured crawler modules</h2>
           <section class="grid">
             {''.join(product_cards)}
           </section>
           <section class="panel">
             <h2>Internal links for recursive crawling</h2>
             <ul>
               <li><a href="/docs/getting-started.html">Getting started guide</a></li>
               <li><a href="/docs/advanced-routing.html">Advanced routing guide</a></li>
               <li><a href="/blog/crawling-at-scale.html">Crawling at scale article</a></li>
               <li><a href="/dynamic.html">JavaScript-rendered catalog</a></li>
               <li><a href="/admin/hidden.html">Admin page blocked by robots and crawler filters</a></li>
             </ul>
           </section>
           """,
       ),
   )
   for product in PRODUCTS:
       related_links = "\n".join(
           f'<li><a class="related-link" href="/products/product-{safe_slug(sku)}.html">{sku}</a></li>'
           for sku in product["related"]
       )
       feature_list = "\n".join(f"<li>{feature}</li>" for feature in product["features"])
       json_ld = json.dumps(
           {
               "@context": "https://schema.org",
               "@type": "Product",
               "sku": product["sku"],
               "name": product["name"],
               "category": product["category"],
               "offers": {
                   "@type": "Offer",
                   "price": product["price"],
                   "priceCurrency": "USD",
               },
               "aggregateRating": {
                   "@type": "AggregateRating",
                   "ratingValue": product["rating"],
               },
           },
           indent=2,
       )
       write_file(
           SITE_DIR / "products" / f"product-{safe_slug(product['sku'])}.html",
           layout(
               f"{product['name']} | Product Detail",
               f"""
               <article class="product"
                        data-sku="{product['sku']}"
                        data-category="{product['category']}"
                        data-rating="{product['rating']}"
                        data-stock="{product['stock']}">
                 <h2 class="product-title">{product['name']}</h2>
                 <p class="sku">SKU: <strong>{product['sku']}</strong></p>
                 <p class="category">Category: <strong>{product['category']}</strong></p>
                 <p class="price" data-price="{product['price']}">${product['price']:.2f}</p>
                 <p class="rating">Rating: {product['rating']} / 5</p>
                 <p class="{'stock-low' if product['stock'] < 10 else 'stock-ok'}">Stock: {product['stock']}</p>
                 <h3>Features</h3>
                 <ul class="features">{feature_list}</ul>
                 <h3>Related modules</h3>
                 <ul>{related_links}</ul>
               </article>
               <script type="application/ld+json">{json_ld}</script>
               """,
           ),
       )

We create a realistic product catalog that becomes the structured data source for our demo website. We define reusable HTML layout logic, styling, navigation, and page templates to make the local website look and behave like a small commercial and documentation portal. We then generate the homepage and product detail pages, including prices, ratings, stock levels, product features, related links, and JSON-LD metadata.

Adding Docs, Blog, Dynamic, and Admin Pages

   write_file(
       SITE_DIR / "docs" / "getting-started.html",
       layout(
           "Getting Started with Reliable Crawlers",
           """
           <article class="doc" data-doc-id="getting-started">
             <h2>HTTP-first crawling strategy</h2>
             <p>
               We start with HTTP crawlers because they are lightweight and efficient.
               Browser crawling is reserved for pages that need JavaScript rendering.
             </p>
             <h2>Core extraction fields</h2>
             <p>
               Each crawler extracts URL, title, page type, text summary, outgoing links, and page-specific metadata.
             </p>
             <pre><code>crawler = BeautifulSoupCrawler(max_requests_per_crawl=20)</code></pre>
             <p><a href="/docs/advanced-routing.html">Next: advanced routing</a></p>
           </article>
           """,
       ),
   )
   write_file(
       SITE_DIR / "docs" / "advanced-routing.html",
       layout(
           "Advanced Routing and Storage",
           """
           <article class="doc" data-doc-id="advanced-routing">
             <h2>Queue filtering</h2>
             <p>
               We filter links to keep the crawl focused on the same local domain and skip admin pages.
             </p>
             <h2>Storage design</h2>
             <p>
               Structured rows go to datasets. Binary screenshots and snapshots go to a key-value store.
             </p>
             <pre><code>await context.enqueue_links(include=[Glob("https://example.com/**")])</code></pre>
             <p><a href="/blog/crawling-at-scale.html">Read the scaling article</a></p>
           </article>
           """,
       ),
   )
   write_file(
       SITE_DIR / "blog" / "crawling-at-scale.html",
       layout(
           "Crawling at Scale",
           """
           <article class="blog-post" data-author="demo-team" data-reading-time="7">
             <h2>Scaling crawler jobs without losing reliability</h2>
             <p>
               Production crawlers need controlled concurrency, retry behavior, stable request queues,
               structured exports, and monitoring-ready output.
             </p>
             <p>
               For AI data workflows, we also normalize text, preserve source URLs, create chunks,
               and record extraction provenance.
             </p>
             <span class="tag">queues</span>
             <span class="tag">datasets</span>
             <span class="tag">rag</span>
             <span class="tag">playwright</span>
           </article>
           """,
       ),
   )
   dynamic_items = json.dumps(
       [
           {
               "sku": "JS-900",
               "name": "Dynamic Inventory Scanner",
               "price": 329.0,
               "stock": 4,
               "desc": "Rendered only after JavaScript executes.",
           },
           {
               "sku": "JS-901",
               "name": "Client-Side Review Miner",
               "price": 279.0,
               "stock": 11,
               "desc": "Created by browser-side DOM manipulation.",
           },
           {
               "sku": "JS-902",
               "name": "Async Catalog Watcher",
               "price": 389.0,
               "stock": 7,
               "desc": "Useful for testing PlaywrightCrawler extraction.",
           },
       ],
       indent=2,
   )
   dynamic_script = f"""
   <script>
     const dynamicItems = {dynamic_items};
     function renderItems() {{
       const root = document.querySelector("#dynamic-products");
       root.innerHTML = "";
       for (const item of dynamicItems) {{
         const card = document.createElement("div");
         card.className = "card js-card";
         card.dataset.sku = item.sku;
         card.dataset.price = item.price;
         card.dataset.stock = item.stock;
         card.innerHTML = `
           <h3>${{item.name}}</h3>
           <p class="desc">${{item.desc}}</p>
           <p class="price">$${{item.price.toFixed(2)}}</p>
           <p class="${{item.stock < 8 ? "stock-low" : "stock-ok"}}">Stock: ${{item.stock}}</p>
         `;
         root.appendChild(card);
       }}
       document.querySelector("#render-status").textContent =
         "Rendered " + dynamicItems.length + " JavaScript items.";
     }}
     setTimeout(renderItems, 600);
   </script>
   """
   write_file(
       SITE_DIR / "dynamic.html",
       layout(
           "JavaScript Rendered Catalog",
           """
           <section class="panel">
             <h2>Dynamic content test</h2>
             <p>
               A plain HTTP crawler can download this page, but it will not see the cards below until JavaScript runs.
               PlaywrightCrawler opens a real browser and extracts the rendered DOM.
             </p>
             <p id="render-status">Waiting for JavaScript rendering...</p>
           </section>
           <section id="dynamic-products" class="grid"></section>
           """,
           extra_script=dynamic_script,
       ),
   )
   write_file(
       SITE_DIR / "admin" / "hidden.html",
       layout(
           "Hidden Admin Page",
           """
           <article class="panel">
             <h2>This page should be skipped</h2>
             <p>
               The crawler excludes this admin path to demonstrate control over the rawl scope 
             </p>
           </article>
           """,
       ),
   )
build_demo_site()
print(f"Demo site generated at: {SITE_DIR}")
class QuietHandler(SimpleHTTPRequestHandler):
   def log_message(self, format, *args):
       pass
def start_local_server(directory):
   probe = socket.socket()
   probe.bind(("127.0.0.1", 0))
   port = probe.getsockname()[1]
   probe.close()
   handler = partial(QuietHandler, directory=str(directory))
   httpd = ThreadingHTTPServer(("127.0.0.1", port), handler)
   thread = threading.Thread(target=httpd.serve_forever, daemon=True)
   thread.start()
   base_url = f"http://127.0.0.1:{port}"
   time.sleep(0.5)
   return httpd, base_url
def extract_json_ld(soup):
   blocks = []
   for script in soup.select('script[type="application/ld+json"]'):
       raw = script.string or script.get_text()
       if not raw:
           continue
       try:
           blocks.append(json.loads(raw))
       except Exception:
           blocks.append({"raw": raw})
   return blocks
def write_json(path, rows):
   path = Path(path)
   path.write_text(json.dumps(rows, ensure_ascii=False, indent=2), encoding="utf-8")
def write_csv(path, rows):
   path = Path(path)
   if not rows:
       path.write_text("", encoding="utf-8")
       return
   flattened = []
   for row in rows:
       flat = {}
       for key, value in row.items():
           if isinstance(value, (list, dict)):
               flat[key] = json.dumps(value, ensure_ascii=False)
           else:
               flat[key] = value
       flattened.append(flat)
   fieldnames = sorted({key for row in flattened for key in row.keys()})
   with path.open("w", newline="", encoding="utf-8") as f:
       writer = csv.DictWriter(f, fieldnames=fieldnames)
       writer.writeheader()
       writer.writerows(flattened)

We expand the demo website by adding documentation pages, a blog article, a JavaScript-rendered catalog page, and an admin page intended to be excluded from crawling. We use these pages to test different crawling scenarios, including static HTML extraction, documentation parsing, blog metadata extraction, dynamic browser rendering, and crawl filtering. We also start a local HTTP server and define utilities to extract JSON-LD content and export crawl results to JSON and CSV.

Static Crawling with BeautifulSoupCrawler and ParselCrawler

async def run_beautifulsoup_crawl(base_url):
   print("\n=== 1) BeautifulSoupCrawler: fast recursive HTTP crawl ===")
   rows = []
   crawler = BeautifulSoupCrawler(
       parser="html.parser",
       max_requests_per_crawl=30,
       max_request_retries=1,
       respect_robots_txt_file=True,
       concurrency_settings=ConcurrencySettings(
           desired_concurrency=4,
           max_concurrency=6,
       ),
   )
   @crawler.router.default_handler
   async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
       soup = context.soup
       url = context.request.url
       title = normalize_text(soup.title.get_text(" ", strip=True) if soup.title else "")
       meta_description = ""
       meta_tag = soup.find("meta", attrs={"name": "description"})
       if meta_tag:
           meta_description = normalize_text(meta_tag.get("content", ""))
       out_links = []
       for a in soup.select("a[href]"):
           href = a.get("href")
           label = normalize_text(a.get_text(" ", strip=True), 120)
           out_links.append({"href": href, "label": label})
       page_text = normalize_text(soup.get_text(" ", strip=True), 1000)
       if "/products/" in url:
           page_type = "product"
       elif "/docs/" in url:
           page_type = "documentation"
       elif "/blog/" in url:
           page_type = "blog"
       elif "/dynamic" in url:
           page_type = "dynamic-shell"
       else:
           page_type = "index"
       row = {
           "source": "beautifulsoup-http",
           "url": url,
           "title": title,
           "page_type": page_type,
           "meta_description": meta_description,
           "text_preview": page_text,
           "out_links": out_links,
           "json_ld": extract_json_ld(soup),
           "extracted_at_unix": time.time(),
       }
       if page_type == "product":
           article = soup.select_one("article.product")
           if article:
               price_node = soup.select_one(".price")
               row["product"] = {
                   "sku": article.get("data-sku"),
                   "category": article.get("data-category"),
                   "name": normalize_text(
                       soup.select_one(".product-title").get_text(" ", strip=True)
                       if soup.select_one(".product-title")
                       else ""
                   ),
                   "price": money_to_float(price_node.get("data-price") if price_node else None),
                   "rating": float(article.get("data-rating")) if article.get("data-rating") else None,
                   "stock": int(article.get("data-stock")) if article.get("data-stock") else None,
                   "features": [
                       normalize_text(li.get_text(" ", strip=True))
                       for li in soup.select(".features li")
                   ],
               }
       if page_type == "documentation":
           row["doc"] = {
               "headings": [
                   normalize_text(h.get_text(" ", strip=True))
                   for h in soup.select("h2, h3")
               ],
               "code_blocks": [
                   normalize_text(code.get_text(" ", strip=True))
                   for code in soup.select("pre code")
               ],
           }
       if page_type == "blog":
           row["blog"] = {
               "author": soup.select_one(".blog-post").get("data-author") if soup.select_one(".blog-post") else None,
               "reading_time": soup.select_one(".blog-post").get("data-reading-time") if soup.select_one(".blog-post") else None,
               "tags": [
                   normalize_text(tag.get_text(" ", strip=True))
                   for tag in soup.select(".tag")
               ],
           }
       rows.append(row)
       await context.push_data(row)
       await context.enqueue_links(
           include=[Glob(f"{base_url}/**")],
           exclude=[
               Glob(f"{base_url}/admin/**"),
               Glob(f"{base_url}/dynamic.html"),
           ],
       )
   await crawler.run([f"{base_url}/index.html"])
   write_json(OUTPUT_DIR / "beautifulsoup_crawl.json", rows)
   write_csv(OUTPUT_DIR / "beautifulsoup_crawl.csv", rows)
   print(f"BeautifulSoup rows extracted: {len(rows)}")
   return rows
async def run_parsel_precision_crawl(base_url):
   print("\n=== 2) ParselCrawler: precise CSS/XPath extraction from product pages ===")
   rows = []
   product_urls = [
       f"{base_url}/products/product-{safe_slug(product['sku'])}.html"
       for product in PRODUCTS
   ]
   crawler = ParselCrawler(
       max_requests_per_crawl=len(product_urls),
       max_request_retries=1,
       concurrency_settings=ConcurrencySettings(
           desired_concurrency=5,
           max_concurrency=8,
       ),
   )
   @crawler.router.default_handler
   async def request_handler(context: ParselCrawlingContext) -> None:
       selector = context.selector
       title = selector.css("title::text").get()
       sku = selector.css("article.product::attr(data-sku)").get()
       category = selector.css("article.product::attr(data-category)").get()
       rating = selector.css("article.product::attr(data-rating)").get()
       stock = selector.css("article.product::attr(data-stock)").get()
       name = selector.css(".product-title::text").get()
       price = selector.css(".price::attr(data-price)").get()
       features = [
           normalize_text(feature)
           for feature in selector.css(".features li::text").getall()
       ]
       row = {
           "source": "parsel-precision",
           "url": context.request.url,
           "title": normalize_text(title),
           "sku": sku,
           "name": normalize_text(name),
           "category": category,
           "price": money_to_float(price),
           "rating": float(rating) if rating else None,
           "stock": int(stock) if stock else None,
           "features": features,
           "xpath_title": normalize_text(selector.xpath("//title/text()").get()),
       }
       rows.append(row)
       await context.push_data(row)
   await crawler.run(product_urls)
   write_json(OUTPUT_DIR / "parsel_products.json", rows)
   write_csv(OUTPUT_DIR / "parsel_products.csv", rows)
   print(f"Parsel product rows extracted: {len(rows)}")
   return rows

We implement the static crawling part of the workflow using BeautifulSoupCrawler and ParselCrawler. With BeautifulSoupCrawler, we recursively crawl the local website and extract page titles, metadata, text previews, outgoing links, product details, documentation headings, code blocks, and blog tags. With ParselCrawler, we perform more targeted CSS and XPath extraction from product pages to collect clean product-level fields, including SKU, category, price, rating, stock, and features.

Dynamic Rendering with PlaywrightCrawler and Link Graphs

async def run_playwright_dynamic_crawl(base_url):
   print("\n=== 3) PlaywrightCrawler: browser-rendered JavaScript crawl ===")
   rows = []
   crawler = PlaywrightCrawler(
       max_requests_per_crawl=2,
       max_request_retries=1,
       headless=True,
       browser_type="chromium",
       browser_launch_options={
           "args": ["--no-sandbox", "--disable-dev-shm-usage"],
       },
       goto_options={
           "wait_until": "domcontentloaded",
       },
       concurrency_settings=ConcurrencySettings(
           desired_concurrency=1,
           max_concurrency=2,
       ),
   )
   @crawler.router.default_handler
   async def request_handler(context: PlaywrightCrawlingContext) -> None:
       await context.page.wait_for_selector(".js-card", timeout=10000)
       cards = await context.page.locator(".js-card").evaluate_all(
           """
           (cards) => cards.map((card) => {
             const h3 = card.querySelector("h3");
             const desc = card.querySelector(".desc");
             const price = card.querySelector(".price");
             return {
               sku: card.dataset.sku,
               name: h3 ? h3.textContent.trim() : null,
               description: desc ? desc.textContent.trim() : null,
               price_text: price ? price.textContent.trim() : null,
               price: Number(card.dataset.price),
               stock: Number(card.dataset.stock),
               rendered_text: card.innerText.trim()
             };
           })
           """
       )
       screenshot_bytes = await context.page.screenshot(full_page=True)
       screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
       screenshot_path.write_bytes(screenshot_bytes)
       try:
           kvs = await context.get_key_value_store()
           await kvs.set_value(
               key="dynamic-catalog-full-page",
               value=screenshot_bytes,
               content_type="image/png",
           )
       except Exception as exc:
           print("Key-value store screenshot save skipped:", repr(exc))
       for card in cards:
           row = {
               **card,
               "source": "playwright-rendered-js",
               "url": context.request.url,
               "screenshot_path": str(screenshot_path),
               "extracted_at_unix": time.time(),
           }
           rows.append(row)
       await context.push_data(rows)
   try:
       await crawler.run([f"{base_url}/dynamic.html"])
   except Exception as exc:
       print("Playwright section failed gracefully.")
       print("Reason:", repr(exc))
   write_json(OUTPUT_DIR / "playwright_dynamic.json", rows)
   write_csv(OUTPUT_DIR / "playwright_dynamic.csv", rows)
   print(f"Playwright dynamic rows extracted: {len(rows)}")
   return rows
def flatten_products(rows):
   products = []
   for row in rows:
       if row.get("page_type") == "product" and isinstance(row.get("product"), dict):
           product = row["product"]
           products.append(
               {
                   "source": row.get("source"),
                   "url": row.get("url"),
                   "sku": product.get("sku"),
                   "name": product.get("name"),
                   "category": product.get("category"),
                   "price": product.get("price"),
                   "rating": product.get("rating"),
                   "stock": product.get("stock"),
                   "features": "; ".join(product.get("features", [])),
               }
           )
       elif row.get("source") == "parsel-precision":
           products.append(
               {
                   "source": row.get("source"),
                   "url": row.get("url"),
                   "sku": row.get("sku"),
                   "name": row.get("name"),
                   "category": row.get("category"),
                   "price": row.get("price"),
                   "rating": row.get("rating"),
                   "stock": row.get("stock"),
                   "features": "; ".join(row.get("features", [])),
               }
           )
       elif row.get("source") == "playwright-rendered-js":
           products.append(
               {
                   "source": row.get("source"),
                   "url": row.get("url"),
                   "sku": row.get("sku"),
                   "name": row.get("name"),
                   "category": "dynamic-js",
                   "price": row.get("price") or money_to_float(row.get("price_text")),
                   "rating": None,
                   "stock": row.get("stock"),
                   "features": row.get("description"),
               }
           )
   return products
def absolute_url(base_url, href):
   if not href:
       return None
   if href.startswith("http://") or href.startswith("https://"):
       return href
   if href.startswith("/"):
       return base_url + href
   return base_url + "/" + href
def build_link_graph(base_url, rows):
   graph = nx.DiGraph()
   for row in rows:
       src = row.get("url")
       if not src:
           continue
       graph.add_node(
           src,
           title=row.get("title", ""),
           page_type=row.get("page_type", ""),
       )
       for link in row.get("out_links", []) or []:
           dst = absolute_url(base_url, link.get("href"))
           if not dst:
               continue
           if "/admin/" in dst:
               continue
           graph.add_node(dst)
           graph.add_edge(src, dst, label=link.get("label", ""))
   return graph

We handle dynamic content using PlaywrightCrawler, which opens the JavaScript-rendered page in a headless Chromium browser. We wait for client-side product cards to appear, extract their rendered fields, capture a full-page screenshot, and save the browser-based results for later analysis. We then define helper functions to normalize product records and build a directed link graph from the internal links discovered during crawling.

Building AI-Ready Outputs and Running the Pipeline

def make_rag_chunks(rows, max_chars=700):
   chunks = []
   for row in rows:
       text = (
           row.get("text_preview")
           or row.get("rendered_text")
           or row.get("description")
           or ""
       )
       text = normalize_text(text)
       if not text:
           continue
       sentences = re.split(r"(?<=[.!?])\s+", text)
       current = ""
       for sentence in sentences:
           if len(current) + len(sentence) + 1 <= max_chars:
               current = (current + " " + sentence).strip()
           else:
               if current:
                   chunks.append(
                       {
                           "chunk_id": hashlib.sha1(
                               (row.get("url", "") + current).encode()
                           ).hexdigest()[:12],
                           "url": row.get("url"),
                           "source": row.get("source"),
                           "page_type": row.get("page_type"),
                           "title": row.get("title") or row.get("name"),
                           "text": current,
                       }
                   )
               current = sentence
       if current:
           chunks.append(
               {
                   "chunk_id": hashlib.sha1(
                       (row.get("url", "") + current).encode()
                   ).hexdigest()[:12],
                   "url": row.get("url"),
                   "source": row.get("source"),
                   "page_type": row.get("page_type"),
                   "title": row.get("title") or row.get("name"),
                   "text": current,
               }
           )
   return chunks
def analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows):
   all_rows = bs4_rows + parsel_rows + playwright_rows
   products = flatten_products(all_rows)
   crawl_df = pd.DataFrame(all_rows)
   product_df = pd.DataFrame(products)
   if not product_df.empty:
       product_df["price"] = pd.to_numeric(product_df["price"], errors="coerce")
       product_df["stock"] = pd.to_numeric(product_df["stock"], errors="coerce")
       product_df["rating"] = pd.to_numeric(product_df["rating"], errors="coerce")
       product_df["inventory_value"] = product_df["price"] * product_df["stock"]
   graph = build_link_graph(base_url, bs4_rows)
   graph_path = OUTPUT_DIR / "site_link_graph.graphml"
   if graph.number_of_nodes() > 0:
       nx.write_graphml(graph, graph_path)
   chunks = make_rag_chunks(all_rows)
   rag_path = OUTPUT_DIR / "rag_chunks.jsonl"
   with rag_path.open("w", encoding="utf-8") as f:
       for chunk in chunks:
           f.write(json.dumps(chunk, ensure_ascii=False) + "\n")
   crawl_json_path = OUTPUT_DIR / "combined_crawl_results.json"
   crawl_json_path.write_text(
       json.dumps(all_rows, ensure_ascii=False, indent=2),
       encoding="utf-8",
   )
   product_csv_path = OUTPUT_DIR / "normalized_product_catalog.csv"
   if not product_df.empty:
       product_df.to_csv(product_csv_path, index=False)
   price_plot_path = OUTPUT_DIR / "product_price_chart.png"
   if not product_df.empty and product_df["price"].notna().any():
       plot_df = product_df.dropna(subset=["price"]).copy()
       plot_df["label"] = plot_df["sku"].fillna("unknown") + "\n" + plot_df["source"].fillna("")
       ax = plot_df.plot(
           kind="bar",
           x="label",
           y="price",
           legend=False,
           figsize=(11, 5),
           title="Extracted Product Prices by Source",
       )
       ax.set_xlabel("Product / extraction source")
       ax.set_ylabel("Price")
       plt.xticks(rotation=35, ha="right")
       plt.tight_layout()
       plt.savefig(price_plot_path, dpi=160)
       plt.show()
   graph_stats = {
       "nodes": graph.number_of_nodes(),
       "edges": graph.number_of_edges(),
       "weakly_connected_components": (
           nx.number_weakly_connected_components(graph)
           if graph.number_of_nodes()
           else 0
       ),
   }
   if graph.number_of_nodes() > 0:
       in_degrees = dict(graph.in_degree())
       out_degrees = dict(graph.out_degree())
       graph_stats["top_in_degree"] = sorted(
           in_degrees.items(),
           key=lambda x: x[1],
           reverse=True,
       )[:5]
       graph_stats["top_out_degree"] = sorted(
           out_degrees.items(),
           key=lambda x: x[1],
           reverse=True,
       )[:5]
   summary = {
       "base_url": base_url,
       "rows_total": len(all_rows),
       "beautifulsoup_rows": len(bs4_rows),
       "parsel_rows": len(parsel_rows),
       "playwright_rows": len(playwright_rows),
       "products_total": len(product_df),
       "rag_chunks_total": len(chunks),
       "graph": graph_stats,
       "outputs": {
           "beautifulsoup_json": str(OUTPUT_DIR / "beautifulsoup_crawl.json"),
           "beautifulsoup_csv": str(OUTPUT_DIR / "beautifulsoup_crawl.csv"),
           "parsel_json": str(OUTPUT_DIR / "parsel_products.json"),
           "parsel_csv": str(OUTPUT_DIR / "parsel_products.csv"),
           "playwright_json": str(OUTPUT_DIR / "playwright_dynamic.json"),
           "playwright_csv": str(OUTPUT_DIR / "playwright_dynamic.csv"),
           "combined_json": str(crawl_json_path),
           "product_csv": str(product_csv_path) if product_csv_path.exists() else None,
           "rag_jsonl": str(rag_path),
           "graphml": str(graph_path) if graph_path.exists() else None,
           "price_plot": str(price_plot_path) if price_plot_path.exists() else None,
           "screenshots_dir": str(SCREENSHOT_DIR),
       },
   }
   summary_path = OUTPUT_DIR / "run_summary.md"
   summary_path.write_text(
       "# Crawlee Python Advanced Tutorial Run Summary\n\n"
       f"- Local demo site: `{base_url}`\n"
       f"- Total extracted rows: `{summary['rows_total']}`\n"
       f"- BeautifulSoup rows: `{summary['beautifulsoup_rows']}`\n"
       f"- Parsel rows: `{summary['parsel_rows']}`\n"
       f"- Playwright rows: `{summary['playwright_rows']}`\n"
       f"- Normalized products: `{summary['products_total']}`\n"
       f"- RAG chunks: `{summary['rag_chunks_total']}`\n"
       f"- Link graph nodes: `{graph_stats['nodes']}`\n"
       f"- Link graph edges: `{graph_stats['edges']}`\n\n"
       "## Output files\n\n"
       + "\n".join(f"- `{k}`: `{v}`" for k, v in summary["outputs"].items())
       + "\n",
       encoding="utf-8",
   )
   print("\n=== 4) Analysis summary ===")
   print(json.dumps(summary, indent=2, ensure_ascii=False))
   try:
       from IPython.display import display, Markdown, Image as IPImage
       display(Markdown("## Crawlee crawl preview"))
       if not crawl_df.empty:
           preview_cols = [
               col for col in ["source", "page_type", "title", "url"]
               if col in crawl_df.columns
           ]
           display(crawl_df[preview_cols].head(12))
       display(Markdown("## Normalized product catalog"))
       if not product_df.empty:
           display(product_df.head(20))
       if price_plot_path.exists():
           display(Markdown("## Product price chart"))
           display(IPImage(filename=str(price_plot_path)))
       screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
       if screenshot_path.exists():
           display(Markdown("## Playwright screenshot of JavaScript-rendered page"))
           display(IPImage(filename=str(screenshot_path)))
       display(Markdown(f"## Output directory\n`{OUTPUT_DIR}`"))
   except Exception as exc:
       print("Notebook display skipped:", repr(exc))
   return summary
async def main():
   httpd, base_url = start_local_server(SITE_DIR)
   print(f"\nLocal demo website is running at: {base_url}/index.html")
   try:
       bs4_rows = await run_beautifulsoup_crawl(base_url)
       parsel_rows = await run_parsel_precision_crawl(base_url)
       playwright_rows = await run_playwright_dynamic_crawl(base_url)
       summary = analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows)
       return summary
   finally:
       httpd.shutdown()
       print("\nLocal demo server shut down.")
loop = asyncio.get_event_loop()
summary = loop.run_until_complete(main())
print("\nTutorial complete.")
print(f"All outputs are in: {OUTPUT_DIR}")
print("Key files:")
for file_path in sorted(OUTPUT_DIR.rglob("*")):
   if file_path.is_file():
       print(" -", file_path)

We process the extracted crawl data into analysis-ready and AI-ready outputs. We create RAG-style JSONL chunks, combine all crawl results, build a normalized product catalog, generate a GraphML link graph, and visualize product prices with Matplotlib. Finally, we run the full pipeline end-to-end, display previews in the notebook, save all generated artifacts, and print the final output file paths.

Conclusion

In conclusion, we have a complete Crawlee-based pipeline for crawling and data engineering that converts a small website into structured, reusable datasets. We used crawl scoping, robots.txt handling, concurrency settings, link enqueuing, browser rendering, key-value storage, and dataset exports to simulate patterns used in production web crawling systems. We normalized the extracted product data, saved the crawl outputs as JSON and CSV, created GraphML link graphs with NetworkX, generated JSONL chunks for retrieval-augmented generation workflows, and visualized the extracted product prices with Matplotlib.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

MarkTechPost

Setting Up the Crawlee Python Runtime and Helpers

Generating the Demo Website and Product Catalog

Adding Docs, Blog, Dynamic, and Admin Pages

Static Crawling with BeautifulSoupCrawler and ParselCrawler

Dynamic Rendering with PlaywrightCrawler and Link Graphs

Building AI-Ready Outputs and Running the Pipeline

Conclusion

Sana Hassan