惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

WordPress大学
WordPress大学
D
Docker
博客园 - 聂微东
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
博客园 - 叶小钗
李成银的技术随笔
Hugging Face - Blog
Hugging Face - Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
大猫的无限游戏
大猫的无限游戏
Jina AI
Jina AI
罗磊的独立博客
小众软件
小众软件
月光博客
月光博客
量子位
雷峰网
雷峰网
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - Franky
The Cloudflare Blog
Microsoft Azure Blog
Microsoft Azure Blog
B
Blog RSS Feed
Last Week in AI
Last Week in AI
J
Java Code Geeks
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
宝玉的分享
宝玉的分享
H
Help Net Security
腾讯CDC
T
ThreatConnect
Cyberwarzone
Cyberwarzone
S
Securelist
A
Arctic Wolf
B
Blog
有赞技术团队
有赞技术团队
Y
Y Combinator Blog
Stack Overflow Blog
Stack Overflow Blog
A
About on SuperTechFans
F
Fox-IT International blog
P
Proofpoint News Feed
The Register - Security
The Register - Security
G
GRAHAM CLULEY
C
CXSECURITY Database RSS Feed - CXSecurity.com
阮一峰的网络日志
阮一峰的网络日志
P
Privacy & Cybersecurity Law Blog
美团技术团队
博客园 - 司徒正美
Apple Machine Learning Research
Apple Machine Learning Research
Security Latest
Security Latest
F
Full Disclosure
Recent Commits to openclaw:main
Recent Commits to openclaw:main
L
Lohrmann on Cybersecurity

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

How to Use Dart Cloud Functions and the Firebase Admin SDK: A Handbook for Developers How to Build an AI-Powered Medical Image De-Identification Pipeline for Clinical Research How to Build a Software Factory with Claude Code: From Vibe Coding to Agentic Development How to Avoid Rebuilding Infrastructure for Every New Project How to Use GitHub Search Like a Pro Learn LaTeX in 41 Hours Think Like the JavaScript Engine How to Encrypt Kubernetes Traffic with cert-manager, Let's Encrypt, and Internal TLS How to Migrate from ASP.NET Framework to ASP.NET Core Learn to Build Automated Workflows with Manis AI Learn to Build Automated Workflows with Manus AI How to Protect Your Privacy Online in 2026 How to Build a Browser-Based PDF Watermark Tool Using JavaScript AI Paper Review: Language Models are Few-Shot Learners (GPT-3) How to Clean Time Series Data in Python 7 Tools Digital Nomads Need in 2026 How to Build a Calculator with Tkinter in Python How to Build an Autonomous OSINT Agent in Python Using Claude's Tool Use API Common DevOps Mistakes and How to Avoid Them — Tips for Startups Claude Code for Beginners AWS Certified Cloud Practitioner Study Course – Pass the Exam With This Free 14-Hour Course Product Experimentation with Synthetic Control: Causal Inference for Global LLM Rollouts in Python How to Build Production-Ready AI Features with Flutter [Full Handbook for Devs] How to Build a Browser-Based PDF to Image Converter Using JavaScript How to Build Optimal AI Agents That Actually Work – A Handbook for Devs How to Develop Chrome Extensions using Plasmo [Full Handbook] Why Your “Simple Deploy” Turned Into a Week of Infrastructure Work AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2) How to Build a Self-Hosted WhatsApp Bot with n8n and WAHA The Codex Handbook: A Practical Guide to OpenAI's Coding Platform Learn Command Line Interface (CLI) Development with Dart: From Zero to a Fully Published Developer Tool How to Bypass Cloud SMTP Restrictions Using Brevo and HTTP APIs How to Apply Academic Theories to Human-Centered Web Design [Full Handbook How to Convert Images to PDF in the Browser Using JavaScript – A Step-by-Step Guide The Rise of AI Agents: How Software Is Learning to Act How to Build a Complete SaaS Payment Flow with Stripe, Webhooks, and Email Notifications Product Experimentation with Regression Discontinuity: How an LLM Confidence Threshold Creates a Natural Experiment in Python How to Build a Live Options Database in Python – A Complete Guide How to Migrate to S3 Native State Locking in Terraform How to Use SCons to Build Software Projects [Full Handbook] How to Run Open Source LLMs Locally and in the Cloud QuRT: The Real-Time OS Inside Your Phone's Processor [Full Handbook] The Real Infrastructure Behind Remote Work (It’s Not Just Wi-Fi) The Lithography Handbook: Machines, Markets, and the Next Wave of Semiconductor Startups ITCM vs DTCM vs DDR: Embedded Memory Types Explained [Full Handbook] AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1) How to Build a Market Research Copilot with MCP and Python [Full Handbook] How to Build a Scoped Note-Taking API with Django Rest Framework and SimpleJWT The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands Mastering the JavaScript Event Loop Data Science Insights: Why the Mean Lies When Handling Messy Retail Data How to Build High-Ranking SEO Landing Page How to Query Data in DynamoDB Using .Net How to Unblock Your AI PR Review Bottleneck: A Tech Lead’s Guide to Building a Codebase-Aware Reviewer How to Navigate Microservices as a Frontend Engineer How to Compress PDF Files in the Browser Using JavaScript (Step-by-Step) Stanford's youngest instructor talks InfoSec, AI, and catching cheaters - Rachel Fernandez interview [Podcast #217] Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python How to Build a Multi-Agent AI System with LangGraph, MCP, and A2A [Full Book] How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway How to Dockerize a Go Application – Full Step-by-Step Walkthrough Learn Hardware, Cloud, DevOps, Networking, Security, Databases, DNS, Git, and Linux Inside TreeHacks 2026, Stanford’s Elite Student Hakc Inside Stanford’s Elite Student Hackathon [Full Documentary] How to Measure Your AI Citation Rate Across ChatGPT, Perplexity, and Claude How to Deploy a Full-Stack Next.js App on Cloudflare Workers with GitHub Actions CI/CD How to Build a Multi-Tenant SaaS Platform with Next.js, Express, and Prisma How I Completed 15 freeCodeCamp Certifications in 4 Months: A Structured Learning Journey How to Build an Agentic Terminal Workflow with GitHub Copilot CLI and MCP Servers How AI Changed the Economics of Writing Clean Code How to Apply STRIDE Threat Modeling and SonarQube Analysis for Secure Software Development How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS How to Split PDF Files in the Browser Using JavaScript (Step-by-Step) How to Build Your Own Language-Specific LLM [Full Handbook] How to Build a Self-Learning RAG System with Knowledge Reflection How to Trace Multi-Agent AI Swarms with Jaeger v2 How I Tested Malaysia's Open Data Portals with Plain English How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik The Hidden Tax of Infrastructure: Why Your Team Shouldn’t Be Running It Anymore From Metrics to Meaning: How PaaS Helps Developers Understand Production From Symptoms to Root Cause: How to Use the 5 Whys Technique Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP 3D Web Development with Blender and Three.js How to Fix a Failing GitHub PR: Debugging CI, Lint Errors, and Build Errors Step by Step How to Merge PDF Files in the Browser Using JavaScript (Step-by-Step) How to Handle Stripe Webhooks Reliably with Background Jobs How to Build an Automatic Knowledge Graph for Your Blog with PHP and JSON-LD Understanding Proxies and Reverse Proxies: Your Gateway to Secure Networking The Evolution of Nvidia Blackwell GPU Memory Architecture How to Use PostgreSQL as a Cache, Queue, and Search Engine The New Definition of Software Engineering in the Age of AI Reclaim Your Time – Master Automation with Zapier How to Create Dynamic Emails in Go with React Email Why Many Beginner Self-Taught Developers Struggle (And What to Do About It) How to Build a Headless WordPress Frontend with Astro SSR on Cloudflare Pages How to Make Your GitHub Profile Stand Out How to Use Context Hub (chub) to Build a Companion Relevance Engine Why Chrome OS Is the Operating System the AI Era Was Built For
Product Experimentation for Collaborative AI Features: Cluster Randomization for LLM-Based Tools in Python
Rudrendu Pau · 2026-05-23 · via freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Product Experimentation for Collaborative AI Features: Cluster Randomization for LLM-Based Tools in Python

Every product experimentation team running causal inference on LLM-based collaborative features eventually hits the same wall: your users aren't independent. Your team ships an AI meeting summarizer to half the enterprise accounts on your platform. The rollout's clean, half on and half off, and you wait for the control group's task completion to stay flat while the treated group's creeps up. Two weeks in, the control group's numbers are moving too. Not as much, but visibly. The feature's confirmed off for those accounts, and you've checked the rollout config twice. Something's still contaminating your control.

You know what it is before you dig into the logs. The AI meeting summaries land in shared Slack channels, the AI-drafted docs show up in shared Google Drive folders, and the AI code review suggestions appear in pull requests that both treated and control engineers read. Behavior changes for the treated users, and a slice of that behavior bleeds back into your control group through the collaboration graph.

This is the collaborator contamination trap. It shows up in every generative AI product that touches shared artifacts: AI meeting notes that teammates read, AI-drafted documents that coworkers edit, AI code suggestions that reviewers evaluate, AI-generated email threads that the whole team replies to. User-level randomization assumes one user's treatment assignment leaves every other user's outcome alone. In a collaborative workspace, that assumption is wrong by design, and the product experiment folds the feature's real effect together with the spillover it creates inside the control group.

Running a collaborative AI feature behind a user-level A/B test is a product experiment that violates the Stable Unit Treatment Value Assumption (SUTVA). The fix is cluster randomization: flip the coin at the workspace level, so entire teams are in or out together, then model the cross-workspace spillover directly.

This tutorial walks through the full pipeline (cluster assignment, a biased, naive user-level OLS, cluster-weighted least squares for honest standard errors, a two-exposure decomposition that identifies direct and spillover effects separately, and cluster-bootstrap confidence intervals) on a 50,000-user synthetic SaaS dataset in which the ground-truth causal effects are known. You'll estimate them, quantify uncertainty, and see where the approach silently breaks.

Companion code: every code block runs end-to-end in the companion notebook at github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/05_cluster_randomization. The notebook (cluster_randomization_demo.ipynb) has all outputs pre-executed, so you can read along on GitHub before running anything locally.

Table of Contents

Why User-Level A/B Randomization Breaks Under Collaboration

The math of an A/B test is elegant because one user's treatment assignment has no bearing on another user's outcome. Flip a coin; half your users get the AI feature, and the coin flip breaks every possible confound by construction. Collaboration breaks that guarantee in three ways.

Shared artifacts travel. The AI summary lands in a channel every teammate reads, the AI-drafted doc goes into a folder every teammate edits, and the AI code review suggestion sits on a pull request every reviewer evaluates. Control users consume those artifacts, whether or not the feature is switched on for them, and the behavioral effects of reading AI-assisted content leak into their outcomes.

Shared workflows create interference. A treated user who relies on the AI summarizer writes shorter follow-up notes, assuming teammates have read the summary. A control user on the same team receives those shorter notes and spends less time reading them, which changes their session length. That means the treated user's assignment has shifted the control user's outcome, which is exactly what SUTVA forbids.

Network adoption follows collaboration. Power users on treated teams experiment with the feature first, then nudge teammates in other workspaces through cross-team channels. If your treated group produces AI-assisted content that your control group reads and copies, the control group is partially treated without ever flipping a switch.

All three mechanisms produce the same symptom: the raw user-level comparison understates the feature's direct effect because the control group is no longer a pure counterfactual. On the synthetic dataset in this tutorial, the ground-truth direct effect is +0.80 min of session time for treated users, and the ground-truth spillover effect is +0.20 min for control users who collaborate across workspaces. A naive user-level OLS recovers +0.6723, a 16 percent underestimate of the direct effect, and reports a standard error that is roughly 19 times too small because it treats 50,000 users as independent, even though the treatment was randomized only across 50 clusters. That's not a small error. It's the kind that ships a broken feature launch decision.

What Cluster Randomization Actually Does

Cluster randomization flips the assignment coin at the workspace level so entire teams land in the same arm, confining most interference to where it belongs and making the residual cross-workspace leakage something you can model directly.

e47df38e-0f83-4b51-99e2-aee7da95a82e

Figure 1(image ab: Schematic of the SUTVA violation that cluster randomization targets. Every user in a treated workspace (top row, red) sees the AI feature. Every user in a control workspace (bottom row) should see nothing, but collaborators (orange) read AI artifacts that travel through shared Slack, documents, and code reviews. Those spillover-exposed users are partially treated. Cluster randomization doesn't make interference disappear; it confines it to within workspace boundaries, leaving the remaining cross-workspace leakage as an identifiable component that a two-exposure model can estimate directly.

If a workspace is treated, every user inside it gets the feature. If it's a control workspace, nobody inside it does. Interference within a workspace is fine because all teammates share the same assignment, and the workspace-level mean captures the full treatment package. The design aims to control interference across workspaces.

The estimator works under a stack of assumptions, and each one has a name worth knowing because the failure modes at the end of this tutorial map directly to specific violations.

  • Cluster-level random assignment. Treatment is assigned at the cluster level by a genuinely random mechanism. Which workspaces land in the treated arm is independent of workspace-level potential outcomes.

  • Partial interference. Interference happens inside clusters but not across them (Hudgens et al.). A treated user in workspace A can affect her teammate in workspace A, but can't affect a user in workspace B. This is the assumption cluster randomization is built around.

  • Cluster-level SUTVA. A workspace's treatment is a single, well-defined package. There's one version of the feature, and within-cluster heterogeneity in exposure is absorbed into the cluster-level effect.

  • Exchangeability of clusters. Before the coin flip, the treated and control workspaces are exchangeable. Randomization achieves this by construction.

  • Sufficient cluster count. Cluster-robust inference relies on a central limit theorem across clusters. Practitioners often use K ≥ 30 as a working floor, though the appropriate threshold depends on cluster-size heterogeneity and the choice of test statistic. Fewer clusters demand a different inference tool, such as randomization inference or a cluster wild bootstrap.

Partial interference is the underlying assumption of load-bearing here. The whole point of cluster randomization is that cross-cluster spillover is smaller and slower than within-cluster spillover, so treating an entire team contains most of the interference where it's supposed to be (Ugander et al.). When cross-cluster spillover is meaningful, a two-exposure model directly identifies and estimates that leakage.

Prerequisites

You'll need Python 3.11 or newer, comfort with pandas and linear regression, and rough familiarity with ordinary least squares.

Install the packages for this tutorial:

pip install numpy pandas statsmodels scipy matplotlib

Here's what's happening: five packages cover the full pipeline. Pandas loads the data and builds the cluster assignment. NumPy handles array arithmetic and bootstrap draws. Statsmodels fits every regression: naive OLS, cluster-weighted least squares, and the two-exposure model with cluster-robust standard errors. Scipy supports the kernel density diagnostic plot, and matplotlib renders it.

Clone the companion repo to get the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Here's what's happening: the clone pulls the companion repo, and generate_data.py produces the shared 50,000-user dataset used across the series. Seed 42 keeps the data reproducible. The 50,000-user scale gives enough users per workspace (about 1,000 each) for the cluster-level inference to behave asymptotically. The output CSV lands at data/synthetic_llm_logs.csv.

Setting up the Working Example

The synthetic dataset simulates a SaaS product with 50,000 users spread across 50 workspaces. The collaborative AI feature ships at full coverage to 25 randomly selected workspaces and stays off for the other 25.

A control user is spillover-exposed when they collaborate across workspaces. In this tutorial, opt_in_agent_mode == 1 serves as a behavioral proxy for that cross-workspace activity: users who actively opt into AI tooling are the ones reading teammate-authored documents, Slack threads, and pull requests where treated-workspace AI output surfaces. In a production deployment, you'd replace this proxy with an observed collaboration graph such as shared-channel membership, doc co-authorship, or reviewer overlap. Because opt_in_agent_mode reflects a voluntary behavioral choice with no random component, the spillover coefficient in a real experiment would absorb selection differences between opting-in and non-opting-in control users. A production spillover flag should be grounded in the observed collaboration graph; behavioral proxies introduce selection bias that the two-exposure model can't correct.

This tutorial constructs session_minutes_obs from scratch by layering known ground-truth effects onto workspace-level baselines. The CSV's session_minutes column is intentionally set aside. That separation lets you verify that every estimator recovers the effects baked in.

The ground-truth effects baked into the scenario are a +0.80-minute direct effect on treated users and a +0.20-minute spillover effect on spillover-exposed control users. Knowing both values is what lets you verify that your estimator recovers them.

Step 1: Build the Cluster Assignment and Spillover Exposure

The first code block loads the data, assigns workspaces to treatment at the cluster level, flags spillover-exposed users, and constructs an observed outcome where the ground truth is known. The outcome starts from a workspace-level baseline so within-workspace correlation is genuine. It then adds the direct effect for treated users, the spillover effect for exposed control users, and Gaussian noise.

import numpy as np
import pandas as pd

DIRECT_EFFECT = 0.80
SPILLOVER_EFFECT = 0.20
DATA_SEED = 42
OUTCOME_NOISE_SD = 0.30

df = pd.read_csv("data/synthetic_llm_logs.csv")
rng = np.random.default_rng(DATA_SEED)

df["treated_workspace"] = (df["workspace_id"] < 25).astype(int)
df["treated_user"] = df["treated_workspace"]
df["spillover_exposed"] = (
    (df["treated_workspace"] == 0) & (df["opt_in_agent_mode"] == 1)
).astype(int)

ws_baseline = pd.DataFrame({
    "workspace_id": np.arange(50),
    "ws_baseline": rng.normal(5.0, 0.30, size=50),
})
df = df.merge(ws_baseline, on="workspace_id")
noise = rng.normal(0, OUTCOME_NOISE_SD, size=len(df))
df["session_minutes_obs"] = (
    df["ws_baseline"]
    + DIRECT_EFFECT * df["treated_user"]
    + SPILLOVER_EFFECT * df["spillover_exposed"]
    + noise
)
df["exposure"] = np.select(
    [df["treated_user"] == 1, df["spillover_exposed"] == 1],
    ["direct", "spillover"],
    default="pure_control",
)

print(f"Total users:             {len(df):,}")
print(f"Treated workspaces:      {df[df.treated_workspace == 1].workspace_id.nunique()}")
print(f"Control workspaces:      {df[df.treated_workspace == 0].workspace_id.nunique()}")
print(f"Treated users:           {df.treated_user.sum():,}")
print(f"Pure-control users:      {(df.exposure == 'pure_control').sum():,}")
print(f"Spillover-exposed users: {(df.exposure == 'spillover').sum():,}")
ws_sizes = df.groupby("workspace_id").size()
print(f"Workspace size: min={ws_sizes.min()} median={int(ws_sizes.median())} max={ws_sizes.max()}")

Expected output:

Total users:             50,000
Treated workspaces:      25
Control workspaces:      25
Treated users:           24,937
Pure-control users:      18,319
Spillover-exposed users: 6,744
Workspace size: min=923 median=1002 max=1052

Here's what's happening: Workspace IDs 0 through 24 become the treated cluster and 25 through 49 become the control cluster, giving you 24,937 treated users and 25,063 control users. Among the controls, 6,744 are flagged as spillover-exposed because they opted into agent mode and sit in a control workspace where they'd plausibly read treated-workspace output through cross-team channels. The remaining 18,319 are pure-control users, untouched by the feature. Workspace sizes range from 923 to 1,052 users, which is close enough to be balanced, so that cluster-weighted and unweighted estimators will behave similarly. The observed outcome session_minutes_obs captures the known ground truth: a treated user adds 0.80 min to their workspace baseline, a spillover-exposed user adds 0.20 min, and every user is subject to Gaussian noise with standard deviation 0.30 min.

0bc04600-b054-405a-b3cf-d57cc8e24e4d

Figure 2 (image above): The three exposure groups on the 50,000-user dataset. The top panel shows the observed-outcome distribution for each group, with dashed vertical lines at the group means (5.06 min pure control, 5.27 min spillover-exposed, 5.79 min treated). The spillover distribution sits between the pure-control and treated distributions, which is the contamination a naive user-level estimator would fold into the control baseline. The bottom panel translates the same groups into raw counts: 18,319 pure-control users, 6,744 spillover-exposed control users, and 24,937 treated users. Where Figure 1 schematically showed the SUTVA violation, this figure shows it at the data scale, and the three-group structure is exactly what Step 4's two-exposure model will identify.

Step 2: Naive User-Level OLS (Biased and Overconfident)

The naive analysis ignores clustering entirely and regresses the observed outcome on each user's treatment assignment, reporting a standard error as if every user were an independent draw. Two things go wrong at once.

import statsmodels.formula.api as smf

naive = smf.ols("session_minutes_obs ~ treated_user", data=df).fit()
print(f"Naive estimate:  {naive.params['treated_user']:+.4f} min")
print(f"Naive SE:        {naive.bse['treated_user']:.4f}  (under-reported)")
ci = naive.conf_int().loc["treated_user"].tolist()
print(f"Naive 95% CI:    [{ci[0]:+.4f}, {ci[1]:+.4f}]")
print(f"Ground truth:    +0.80")
print(f"Bias:            {naive.params['treated_user'] - 0.80:+.4f} min")

Expected output:

Naive estimate:  +0.6723 min
Naive SE:        0.0034  (under-reported)
Naive 95% CI:    [+0.6656, +0.6790]
Ground truth:    +0.80
Bias:            -0.1277 min

Here's what's happening: the point estimate lands at +0.6723, 16 percent below the ground-truth direct effect of +0.80. The bias has two components. First, spillover contamination: 6,744 control users who read treated-workspace output lie above the pure-control baseline, raising the control mean and compressing the naive treated-minus-control gap. Second, workspace baseline imbalance: with only 50 clusters, random assignment doesn't guarantee that treated and control workspace pools draw equal mean baselines. This dataset's specific seed produces a treated-pool baseline slightly below the control-pool baseline, adding additional downward pressure on the estimate. The lesson generalizes: at small K, balance checks on observable workspace characteristics before the experiment are the only defense against pre-existing between-arm differences that no standard-error correction can fix.

The standard error is the more alarming number. At 0.0034, it reflects variation across 50,000 users treated as independent observations, and the resulting 95% confidence interval [+0.6656, +0.6790] excludes the ground truth entirely, at roughly one-twentieth the width the design actually supports. An SE 19 times too small inflates the t-statistic by the same factor, making the naive regression's p-value appear orders of magnitude more significant than the design justifies. A stakeholder reading this report would walk away confident that the direct effect is somewhere near 0.67 min. Wrong number, wrong precision.

Step 3: Cluster-Weighted Least Squares (Honest Standard Error)

The fix for the standard error is to aggregate to 50 workspace means, then regress those means on the workspace-level treatment indicator weighted by workspace size. Inference is now based on K = 50 observations.

import statsmodels.api as sm

ws = (
    df.groupby("workspace_id")
    .agg(ws_mean=("session_minutes_obs", "mean"),
         ws_size=("user_id", "count"),
         treated=("treated_workspace", "max"))
    .reset_index()
)
X_ws = sm.add_constant(ws["treated"])
wls = sm.WLS(ws["ws_mean"], X_ws, weights=ws["ws_size"]).fit()
wls_ci = wls.conf_int().loc["treated"].tolist()
print(f"WLS cluster-mean contrast: {wls.params['treated']:+.4f} min")
print(f"WLS SE:          {wls.bse['treated']:.4f}  (based on K=50 clusters)")
print(f"WLS 95% CI:      [{wls_ci[0]:+.4f}, {wls_ci[1]:+.4f}]")

Expected output:

WLS cluster-mean contrast: +0.6723 min
WLS SE:                    0.0652  (based on K=50 clusters)
WLS 95% CI:                [+0.5412, +0.8035]

Here's what's happening: the cluster-mean contrast is identical to the naive estimate at +0.6723, because weighted workspace means are a different aggregation of the same user-level data. What changed is the standard error. At 0.0652, it's roughly 19 times larger than the naive 0.0034 and reflects genuine variation across 50 cluster means (statsmodels WLS uses t(48) critical values in place of z=1.96, which is why the CI bounds differ slightly from a hand calculation with z). The 95% confidence interval expands to [+0.5412, +0.8035], which barely covers the ground truth. WLS has fixed the inference problem, so the standard error now reflects the actual design, but it hasn't fixed the identification problem. Control workspace means still includes spillover-exposed users, so this estimate is a contaminated contrast you can't interpret as a clean ATE. The next step separates the two.

Step 4: Two-Exposure Decomposition (Unbiased Direct and Spillover)

The two-exposure model treats each user's exposure as a three-category variable (direct, spillover, or pure control) and regresses the outcome on the two non-baseline categories (Aronow et al.). Pure control is the omitted reference, so both coefficients are directly interpretable: one is the direct effect of the feature, the other is the spillover effect on control users who collaborate across workspaces.

df["is_direct"] = (df["exposure"] == "direct").astype(int)
df["is_spillover"] = (df["exposure"] == "spillover").astype(int)
two_exp = smf.ols(
    "session_minutes_obs ~ is_direct + is_spillover",
    data=df,
).fit(cov_type="cluster", cov_kwds={"groups": df["workspace_id"]})
direct = two_exp.params["is_direct"]
spillover = two_exp.params["is_spillover"]
direct_ci = two_exp.conf_int().loc["is_direct"].tolist()
spillover_ci = two_exp.conf_int().loc["is_spillover"].tolist()
print(f"Direct effect:     {direct:+.4f} min  (ground truth = +0.80)")
print(f"  SE:              {two_exp.bse['is_direct']:.4f}")
print(f"  95% CI:          [{direct_ci[0]:+.4f}, {direct_ci[1]:+.4f}]")
print(f"Spillover effect:  {spillover:+.4f} min  (ground truth = +0.20)")
print(f"  SE:              {two_exp.bse['is_spillover']:.4f}")
print(f"  95% CI:          [{spillover_ci[0]:+.4f}, {spillover_ci[1]:+.4f}]")
spillover_share = (df["exposure"] == "spillover").mean()
projected = direct + spillover_share * spillover
print(f"Spillover share of all users: {spillover_share:.4f}")
print(f"Projected total under full rollout: {projected:+.4f} min")

Expected output:

Direct effect:     +0.7284 min  (ground truth = +0.80)
  SE:              0.0647
  95% CI:          [+0.6016, +0.8552]
Spillover effect:  +0.2083 min  (ground truth = +0.20)
  SE:              0.0038
  95% CI:          [+0.2008, +0.2158]
Spillover share of all users: 0.1349
Projected total under full rollout: +0.7565 min

Here's what's happening: fitting on the three-category exposure with cluster-robust standard errors keyed to workspace_id yields two clean coefficients. The direct effect is +0.7284, with a 95% CI of [+0.6016, +0.8552], which includes the ground-truth value of +0.80. The spillover effect is +0.2083, with a 95% CI of [+0.2008, +0.2158], which tightly covers the ground-truth +0.20. The spillover SE (0.0038) looks small for cluster-robust inference because the simulated spillover effect is uniform across all 25 control clusters; in real data with heterogeneous spillover intensity, you'll see the cluster-robust SE grow meaningfully larger. The projected total of +0.7565 min accounts for the spillover effect, based on the fraction of users expected to be spillover-exposed at a given deployment scale (0.1349 in this dataset). In a production deployment, you'd replace that fraction with whatever share your collaboration graph predicts will be spillover-exposed under your rollout plan. The projection is a design parameter in your rollout, so state the assumed share explicitly when you report the number.

Step 5: Cluster-Bootstrap Confidence Intervals

The cluster bootstrap resamples entire workspaces to test whether Step 4's analytic confidence intervals hold without assuming the central limit theorem has fully kicked in at K = 50. Analytic standard errors for a cluster design work well when K is large, and workspaces are roughly equal in size; the bootstrap confirms this holds in practice for your actual data. Resampling individual users would undercount variance because users in the same workspace share the cluster assignment and the workspace-level baseline; the cluster bootstrap preserves that correlation structure.

def naive_point(d):
    return smf.ols(
        "session_minutes_obs ~ treated_user", data=d
    ).fit().params["treated_user"]

def wls_point(d):
    w = (d.groupby("workspace_id").agg(
            ws_mean=("session_minutes_obs", "mean"),
            ws_size=("user_id", "count"),
            treated=("treated_workspace", "max")).reset_index())
    X = sm.add_constant(w["treated"])
    return sm.WLS(w["ws_mean"], X, weights=w["ws_size"]).fit().params["treated"]

def two_exp_point(d):
    fit = smf.ols(
        "session_minutes_obs ~ is_direct + is_spillover", data=d
    ).fit(cov_type="cluster", cov_kwds={"groups": d["workspace_id"]})
    return fit.params["is_direct"], fit.params["is_spillover"]

rng_boot = np.random.default_rng(7)
ws_ids = df["workspace_id"].unique()
k = len(ws_ids)
reps = {"naive": [], "cluster_wls": [], "direct": [], "spillover": []}
for _ in range(500):
    draw = rng_boot.choice(ws_ids, size=k, replace=True)
    sample = pd.concat(
        [df[df["workspace_id"] == wid] for wid in draw],
        ignore_index=True,
    )
    reps["naive"].append(naive_point(sample))
    reps["cluster_wls"].append(wls_point(sample))
    d_b, s_b = two_exp_point(sample)
    reps["direct"].append(d_b)
    reps["spillover"].append(s_b)

for key, truth in [("naive", 0.80), ("cluster_wls", 0.80),
                   ("direct", 0.80), ("spillover", 0.20)]:
    arr = np.array(reps[key])
    lo, hi = np.percentile(arr, [2.5, 97.5])
    covers = "covers" if lo <= truth <= hi else "misses"
    print(f"{key:<13} 95% CI: [{lo:+.4f}, {hi:+.4f}]   ({covers} {truth:+.2f})")

Expected output:

naive         95% CI: [+0.5386, +0.7966]   (misses +0.80)
cluster_wls   95% CI: [+0.5386, +0.7966]   (misses +0.80)
direct        95% CI: [+0.5931, +0.8519]   (covers +0.80)
spillover     95% CI: [+0.2008, +0.2164]   (covers +0.20)

Here's what's happening: drawing 50 workspaces with replacement and refitting each estimator 500 times gives you a bootstrap distribution for every point estimate. The naive OLS and cluster WLS estimators produce identical bootstrap intervals because they share the same point estimate under workspace-level resampling, and both intervals exclude the ground-truth +0.80 because both are biased by the two sources identified in Step 2 (spillover contamination and the workspace baseline imbalance). The direct-effect interval from the two-exposure model is [0.5931, 0.8519], which includes 0.80. The spillover interval is [+0.2008, +0.2164], which tightly covers +0.20. The cluster bootstrap confirms what the analytic cluster-robust standard errors in Step 4 already showed: inference holds up without relying on asymptotic approximations at K = 50. Running this takes about one minute on a laptop.

When Cluster Randomization Fails

Cluster randomization solves the SUTVA problem when its assumptions hold, and it produces biased estimates that look clean when they don't. Three failure modes map to a named identification assumption; a fourth addresses estimator efficiency when cluster sizes are unequal.

Too few clusters (violates sufficient cluster count). Cluster-robust standard errors rely on a central limit theorem across clusters, and practitioners often use K ≥ 30 as a working floor, though the appropriate threshold depends on heterogeneity in cluster sizes and the choice of test statistic (MacKinnon & Webb, 2017). A collaborative AI feature rolled out to four customer accounts doesn't clear that bar. Cluster-robust standard errors with K = 4 are anticonservative, and the resulting confidence intervals are too narrow. When K is small, randomization inference or a cluster wild bootstrap gives you valid p-values.

Cluster boundary does not contain the interference graph (violates partial interference). Cluster randomization assumes interference is confined within workspaces. If your users collaborate heavily across workspaces through Slack Connect channels, external shared documents, or customer community forums, partial interference is a fiction, and spillover bleeds across every cluster boundary. The two-exposure model can absorb modest cross-cluster leakage because the spillover coefficient captures whatever spillover your exposure flag measures. When leakage is structural, you need the observed collaboration graph and a graph-cluster randomization design that builds clusters from the collaboration structure itself (Ugander et al.).

Heterogeneous cluster sizes that bias the aggregation (estimator efficiency). Equal-weighted cluster means treat a 50-user workspace the same as a 5,000-user workspace, which is a poor efficiency trade when the variance of a workspace's mean depends on the number of users in it. The fix is weighted least squares by workspace size, or a mixed-effects model with workspace random intercepts. This is an efficiency concern with no bearing on identification, and that distinction matters: the point estimate stays consistent under either weighting choice.

Post-hoc cluster construction (violates exchangeability). Building cluster assignments after observing outcomes is the cleanest way to turn a valid design into p-hacking. You've got to define and commit your clusters before the randomization, ideally in a pre-registered analysis plan. Any post-hoc adjustment to cluster boundaries (dropping a workspace with extreme outcomes, merging small workspaces into a composite, redefining spillover exposure after inspecting the data) reintroduces selection bias that no standard-error correction can fix.

Two additional threats deserve attention in real deployments.

Cluster-level SUTVA fails under partial feature adoption. The cluster-level SUTVA assumption requires that a workspace's treatment is a single, well-defined package. That breaks down when a feature rolls out at different adoption rates within a single workspace, or when multiple feature versions coexist (advanced for power users, basic for casual users). In that case, the cluster-level "treatment" conflates multiple effects, and the estimand is no longer interpretable.

Workspace-level confounders when randomization isn't mechanical. In enterprise deployments, workspace selection into the treated arm is often not fully random. Beta programs attract tech-forward accounts; customer success teams influence which clients get early access. When exchangeability is violated before the coin flip, cluster-robust standard errors cannot correct for pre-existing systematic differences between the treated and control workspace pools. A balance check on observable workspace characteristics (size, industry, baseline engagement) and regression adjustment at the cluster level are the standard remedies.

These failure modes stay invisible in your regression coefficients. They surface later, in the gap between the offline estimate and the production rollout. Cluster counts, collaboration graph audits, and a written pre-registration are your only real defenses.

What To Do Next

Cluster randomization is the right tool when collaboration within a workspace creates spillover effects that break user-level SUTVA, and when your clusters are natural and observable (workspaces, teams, accounts, physical stores). If the interference you care about spans geographic markets or occurs over time inside a two-sided marketplace where drivers and riders clear as a whole, switchback experiments that randomize time slots fit better. If your treatment is assigned at the individual level but you suspect unobserved cross-user confounders, an instrumental variable analysis with a design-based instrument provides a cleaner identification strategy. When interference is known and complex, graph-cluster randomization with Horvitz-Thompson weighted exposure estimators gives you unbiased effect estimates without forcing every cluster boundary to contain every interference path.

The companion notebook for this tutorial lives at github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/05_cluster_randomization. Clone the repo, generate the synthetic dataset, and run cluster_randomization_demo.ipynb (or cluster_randomization_demo.py) to reproduce every code block, every number, and every figure from this tutorial.

When a collaborative AI feature ships to teams who share their work, the user-level A/B estimate is almost always wrong. Cluster randomization plus a two-exposure model gives you the direct effect and the spillover effect separately, and the cluster bootstrap gives you an interval you can defend when a stakeholder asks how much of the lift comes from the feature and how much comes from teammates talking to each other.



Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started