A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics

MarkTechPost

Sana Hassan · 2026-06-15 · via MarkTechPost

In this tu t orial, we explore the FineWeb dataset through an advanced hands-on workflow. We stream a manageable sample of the dataset without downloading the full multi-terabyte corpus, inspect its schema and metadata, and analyze key fields such as URL, language, language score, and token count. We also reproduce simplified versions of FineWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, verify token counts with the GPT-2 tokenizer, and generate useful analytics on domains, language scores, document lengths, and tokenizer efficiency.

import subprocess, sys
def pip(*pkgs):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
pip("datasets>=2.19", "datasketch", "tiktoken", "pandas", "matplotlib", "tqdm")
import re, math, random, collections
from urllib.parse import urlparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from datasets import load_dataset
random.seed(0); np.random.seed(0)
pd.set_option("display.max_colwidth", 90)

We begin by installing all required libraries for streaming, analysis, deduplication, tokenization, and visualization. We import the core Python packages needed to process FineWeb documents and work with tabular data. We also set random seeds and display options so that our results remain consistent and easier to inspect.

N_DOCS = 3000
print(f"Streaming {N_DOCS} docs from FineWeb sample-10BT ...")
stream = load_dataset(
   "HuggingFaceFW/fineweb",
   name="sample-10BT",
   split="train",
   streaming=True,
)
docs = []
for i, doc in enumerate(tqdm(stream, total=N_DOCS)):
   docs.append(doc)
   if i + 1 >= N_DOCS:
       break
df = pd.DataFrame(docs)
print("\nColumns:", list(df.columns))
print(df[["url", "language", "language_score", "token_count"]].head(5))
ex = docs[0]
print("\n--- Example record (fields) ---")
for k, v in ex.items():
   preview = (v[:120] + "…") if isinstance(v, str) and len(v) > 120 else v
   print(f"{k:>16}: {preview}")

We stream a fixed number of documents from the FineWeb sample-10BT subset without downloading the full dataset. We convert the streamed records into a DataFrame and inspect key metadata fields, including URL, language, language score, and token count. We also print a complete example record to better understand the dataset’s structure.

WORD = re.compile(r"\b\w+\b")
def gopher_quality(text):
   words = WORD.findall(text)
   n = len(words)
   if n < 50 or n > 100_000:
       return False, "word_count_out_of_range"
   mean_len = sum(len(w) for w in words) / n
   if mean_len < 3 or mean_len > 10:
       return False, "bad_mean_word_length"
   if (text.count("#") + text.count("...")) / n > 0.1:
       return False, "too_many_symbols"
   lines = text.split("\n")
   if lines and sum(l.lstrip().startswith(("•", "-", "*")) for l in lines) / len(lines) > 0.9:
       return False, "mostly_bullets"
   stops = {"the", "be", "to", "of", "and", "that", "have", "with"}
   if len(stops & {w.lower() for w in words}) < 2:
       return False, "too_few_stopwords"
   return True, "ok"
def c4_quality(text):
   lines = [l for l in text.split("\n") if l.strip()]
   if not lines:
       return False, "empty"
   low = text.lower()
   for bad in ("lorem ipsum", "javascript is disabled"):
       if bad in low:
           return False, f"boilerplate:{bad}"
   if text.count("{") > 0 and text.count("{") / max(len(lines), 1) > 0.5:
       return False, "too_many_braces"
   return True, "ok"
def fineweb_custom(text):
   lines = [l.strip() for l in text.split("\n") if l.strip()]
   if not lines:
       return False, "empty"
   dup_frac = 1 - len(set(lines)) / len(lines)
   if dup_frac > 0.3:
       return False, "duplicated_lines"
   short_frac = sum(len(l) < 30 for l in lines) / len(lines)
   if short_frac > 0.67 and len(lines) > 5:
       return False, "list_like"
   return True, "ok"
results = []
for d in docs:
   t = d["text"]
   g_ok, g_r = gopher_quality(t)
   c_ok, c_r = c4_quality(t)
   f_ok, f_r = fineweb_custom(t)
   reason = "kept" if (g_ok and c_ok and f_ok) else (g_r if not g_ok else c_r if not c_ok else f_r)
   results.append(reason)
filter_summary = pd.Series(results).value_counts()
print("\n--- Quality-filter outcomes on already-clean FineWeb data ---")
print("(Most pass: FineWeb is pre-filtered. Rejections show what the rules catch.)")
print(filter_summary)

We recreate simplified versions of FineWeb’s quality filters using Gopher-style, C4-style, and custom text-cleaning heuristics. We check each document for issues such as abnormal word counts, poor word statistics, boilerplate text, repeated lines, and list-like structure. We summarize how many documents pass or fail these filters to understand the quality of the already-cleaned FineWeb sample.

from datasketch import MinHash, MinHashLSH
def shingles(text, k=5):
   toks = WORD.findall(text.lower())
   return {" ".join(toks[i:i+k]) for i in range(max(len(toks) - k + 1, 1))}
NUM_PERM = 128
THRESHOLD = 0.7
lsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM)
minhashes = {}
for idx, d in enumerate(tqdm(docs, desc="MinHashing")):
   m = MinHash(num_perm=NUM_PERM)
   for s in shingles(d["text"]):
       m.update(s.encode("utf8"))
   minhashes[idx] = m
   lsh.insert(str(idx), m)
dup_pairs = set()
for idx, m in minhashes.items():
   for cand in lsh.query(m):
       c = int(cand)
       if c != idx:
           dup_pairs.add(tuple(sorted((idx, c))))
print(f"\nFound {len(dup_pairs)} near-duplicate pairs (Jaccard ≥ {THRESHOLD}).")
if dup_pairs:
   a, b = next(iter(dup_pairs))
   j = minhashes[a].jaccard(minhashes[b])
   print(f"Example pair (estimated Jaccard ≈ {j:.2f}):")
   print("  DOC A:", docs[a]["text"][:160].replace("\n", " "), "…")
   print("  DOC B:", docs[b]["text"][:160].replace("\n", " "), "…")
else:
   print("No near-dupes in this slice — expected, since FineWeb is dedup'd per crawl.")

We implement MinHash-based near-duplicate detection to approximate how large web corpora identify repeated or highly similar documents. We convert each document into word shingles, generate MinHash signatures, and index them with Locality Sensitive Hashing. We then search for near-duplicate document pairs and inspect an example if any similar texts are found.

import tiktoken
enc = tiktoken.get_encoding("gpt2")
check = docs[:200]
recomputed = [len(enc.encode(d["text"])) for d in tqdm(check, desc="Tokenizing")]
stored = [d["token_count"] for d in check]
diffs = np.array(recomputed) - np.array(stored)
print(f"\n--- Verifying token_count field (gpt2) on 200 docs ---")
print(f"Mean abs diff vs stored token_count: {np.abs(diffs).mean():.2f} tokens")
print(f"Exact matches: {(diffs == 0).mean()*100:.0f}%   (small drift = tokenizer version)")
df["chars_per_token"] = df["text"].str.len() / df["token_count"].clip(lower=1)
print(f"Avg characters per token: {df['chars_per_token'].mean():.2f}")

We verify the dataset’s token_count field by recomputing GPT-2 token counts with the tiktoken tokenizer. We compare the recomputed token counts with the stored values and measure the average difference between them. We also calculate characters per token to understand tokenizer efficiency across the sampled documents.

df["domain"] = df["url"].apply(lambda u: urlparse(u).netloc.replace("www.", "") if isinstance(u, str) else "?")
top_domains = df["domain"].value_counts().head(15)
print("\n--- Top 15 domains in sample ---")
print(top_domains)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes[0, 0].hist(df["token_count"].clip(upper=4000), bins=50, color="#7b2d26")
axes[0, 0].set_title("Token count per document (gpt2)")
axes[0, 0].set_xlabel("tokens"); axes[0, 0].set_ylabel("docs")
axes[0, 1].hist(df["language_score"], bins=40, color="#2d5d7b")
axes[0, 1].axvline(0.65, color="red", ls="--", label="FineWeb cutoff 0.65")
axes[0, 1].set_title("fastText English language score")
axes[0, 1].set_xlabel("score"); axes[0, 1].legend()
axes[1, 0].hist(df["chars_per_token"].clip(upper=8), bins=40, color="#3f7b2d")
axes[1, 0].set_title("Characters per token (compression)")
axes[1, 0].set_xlabel("chars / token")
top_domains.iloc[::-1].plot(kind="barh", ax=axes[1, 1], color="#7b5d2d")
axes[1, 1].set_title("Top domains")
plt.tight_layout()
plt.show()
print("\n" + "=" * 70)
print("SUMMARY")
print("=" * 70)
print(f"Docs streamed          : {len(df):,}")
print(f"Total gpt2 tokens       : {df['token_count'].sum():,}")
print(f"Median tokens/doc       : {int(df['token_count'].median())}")
print(f"Unique domains          : {df['domain'].nunique():,}")
print(f"Mean language_score     : {df['language_score'].mean():.3f}")
print(f"Near-duplicate pairs    : {len(dup_pairs)}")
print(f"Docs flagged by filters : {(pd.Series(results) != 'kept').sum()} / {len(results)}")
print("\nNext steps:")
print("  • Swap name='sample-10BT' for a real crawl, e.g. name='CC-MAIN-2024-10'")
print("  • Raise N_DOCS for stronger statistics")
print("  • Use the full datatrove pipeline to reproduce FineWeb end-to-end")

We extract domain names from URLs and identify the most frequent domains present in the FineWeb sample. We create visualizations for token count distribution, language score distribution, characters per token, and top domains. We finish by printing a compact summary of streamed documents, total tokens, median length, unique domains, language quality, duplicate count, and filter results.

In conclusion, we developed a practical understanding of how large-scale web datasets such as FineWeb are explored, filtered, deduplicated, and analyzed for language model training. We worked efficiently with streaming data, tested quality heuristics on real documents, identified near-duplicate text patterns, and validated token-level metadata using a production-style tokenizer. It can be used to scale the workflow to larger FineWeb crawls, perform deeper corpus analysis, and design high-quality preprocessing pipelines for LLM dataset preparation.

Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

MarkTechPost

Sana Hassan