Building a Letterboxd Film & Review data pipeline: from raw scrape to first insight

When you need Letterboxd Film & Review as a recurring feed, the gap between "got a few rows out" and "have a clean nightly dataset in the warehouse" is wider than it looks. Here is the pipeline I sketched out, with the decisions I made at each step.

Source survey

Letterboxd Scraper Films, Ratings, Reviews & User Data Scrape films, ratings, cast & crew, genres, and user reviews from Letterboxd, the world's leading social film-discovery platform. For pipeline purposes, the relevant questions are: how stable is the source markup, what is the natural pagination unit, and how aggressively does it rate-limit. For this source the answer is "stable enough, list-based pagination, moderate rate-limiting" -- which makes it a good candidate for a daily incremental job rather than a streaming one.

Output schema

The actor I used emits records with these fields:

type -- type
filmSlug -- film slug
title -- title
year -- year
director -- director
cast -- cast
genres -- genres
runtime -- runtime
averageRating -- average rating
ratingsCount -- ratings count
language -- language
country -- country
synopsis -- synopsis
posterUrl -- poster url
filmUrl -- film url
embeddedReviewCount -- embedded review count
scrapedAt -- scraped at
reviews -- reviews

For warehouse ingestion I would keep this almost as-is. Promote the obvious identifier field to a primary key, cast the timestamp columns to native types, and stash any deeply nested or free-text fields in a TEXT column rather than trying to normalise them.

Sample records

A peek at two raw rows from a sample run:

{
  "type": "film",
  "filmSlug": "the-godfather",
  "title": "The Godfather",
  "year": "1972",
  "director": [
    "Francis Ford Coppola"
  ],
  "cast": [
    "Marlon Brando",
    "Al Pacino",
    "... (8 more)"
  ],
  "genres": [
    "Crime",
    "Drama"
  ],
  "runtime": "175 mins",
  "averageRating": 4.52,
  "ratingsCount": 2666451
}

The flat structure is forgiving. You can drop this straight into a staging table with CREATE TABLE ... AS SELECT * FROM read_json_auto(...) in DuckDB, or pd.json_normalize(rows) in Python, and the downstream model layer barely needs any work.

Pipeline stages

For community managers, trend researchers and brand-monitoring teams this is the rough shape I would build:

Extract: schedule the scraper to run every N hours, write the raw JSON to object storage partitioned by date.
Land: load the raw JSON into a staging table with minimal type coercion -- you want to be able to replay history without re-scraping.
Transform: dedupe on the natural key, enrich with reference data, surface a curated view for social listening, sentiment tracking, brand monitoring and content research.
Serve: expose a thin API or dashboard on the curated view. This is the layer your stakeholders actually touch.

Operational considerations

Three things bite people on these pipelines: schema drift in the upstream source, duplicate records from overlapping scrape windows, and quietly failing runs. Wire up record-count assertions early -- a sudden 50% drop is almost always a sign that the site changed and your selectors need a refresh, not a real shift in supply.

Tooling choices

A few opinionated picks I would default to for this kind of pipeline: object storage (S3, GCS, R2) for the raw landing zone because it is cheap and replayable; a columnar warehouse (BigQuery, Snowflake, DuckDB if you are small) for the staging and curated layers because the analytical queries you will run over this dataset are pretty much exclusively column-scans; a tiny dbt or SQLMesh project for the transformations because version-controlled, tested SQL is much nicer to maintain than ad-hoc queries; and a workflow orchestrator (Airflow, Prefect, GitHub Actions on a cron) for scheduling. None of those are exotic choices, which is the point -- the boring stack is the right stack for a feed like this.

Verdict

For a single-source feed like Letterboxd Film & Review, the work is mostly in the staging and dedup logic. The extraction itself is a solved problem if you do not insist on rolling your own crawler. Once the data is landing reliably, the analytical layer is where you spend your time -- and that is the layer where the dataset actually pays for itself.

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/letterboxd-film-review-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.

推荐订阅源

DEV Community