Clustered Standard Errors in AB Tests

Author:Murphy  |  View: 21340  |  Time: 2025-03-22 22:24:03

CAUSAL DATA SCIENCE

Cover, image by Author

A/B tests are the golden standard of Causal Inference because they allow us to make valid causal statements under minimal assumptions, thanks to randomization. In fact, by randomly assigning a treatment (a drug, ad, product, …), we are able to compare the outcome of interest (a disease, firm revenue, customer satisfaction, …) across subjects (patients, users, customers, …) and attribute the average difference in outcomes to the causal effect of the treatment.

Sometimes it happens that the unit of treatment assignment differs from the unit of observation. In other words, we do not take the decision on whether to treat every single observation independently, but rather in groups. For example, we might decide to treat all customers in a certain region while observing outcomes at the customer level, or treat all articles of a certain brand, while observing outcomes at the article level. Usually this happens because of practical constraints. In the first example, the so-called geo-experiments, it happens because we are unable to track users because of cookie deprecations.

When this happens, treatment effects are not independent across observations anymore. In fact, if a customer in a region is treated, also other customers in the same region will be treated. If an article of a brand is not treated, also other articles of the same brand will not be treated. When doing inference, we have to take this dependence into account: standard errors, confidence intervals, and p-values should be adjusted. In this article, we will explore how to do that using cluster-robust standard errors.


Customers, Orders, and Standard Errors

Imagine you were an online platform and you were interested in increasing sales. You just had a great idea: showing a carousel of related articles at checkout to incentivize customers to add other articles to their basket. In order to understand whether the carousel increases sales, you decide to AB test it. In principle, you could just decide for every order whether to display the carousel or not, at random. However, this would give customers an inconsistent experience, potentially harming the business. Therefore, you decide to randomize the treatment assignment, the carousel, at the customer level. To treated customers, you show the carousel on every order, and to control customers you never show the carousel.

I import the data-generating process (DGP) from src.dgp_collection and the plotting theme from src.theme.

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import numpy as np
from numpy.linalg import inv
import pandas as pd
​
from src.theme import *
from src.dgp_collection import dgp_carousel

We generate simulated data for 3000 orders, from 100 customers. For each order, we observe the order_id which is also the dataset index, the customer_id of the customer who placed the order, whether the carousel was displayed at checkout – the treatment – and the order revenue – the outcome of interest.

n = 3000
dgp = dgp_carousel(n=n, w="carousel", y=["revenue"])
df = dgp.generate_data()
df.head()
Data snapshot, image by Author

​​Since the treatment was randomized, we can estimate the average treatment effect by comparing the average revenue of treated orders against the average revenue of control (not treated) orders. Randomization ensures that treated and untreated orders are comparable on average, except for the treatment itself, and therefore we can attribute any observable difference to the causal effect of the treatment. We can estimate the difference-in-means using linear regression, by regressing revenue on the carousel dummy variable. The reported OLS coefficient is the estimated average treatment effect.

import statsmodels.formula.api as smf
smf.ols("revenue ~ carousel", data=df).fit().summary().tables[1]​
Linear regression summary table, image by Author

It seems that including the carousel decreased revenue by -1.63€ per order. The treatment effect seems to be statistically significant at the 1% level since the reported p-value is 0.005. However, this is not a standard AB test, since we didn't randomize each single order, but rather customers. Two orders placed by the same customer couldn't go to different treatment groups. Because of this, our treatment effects across observations are correlated, while we computed our standard errors assuming that they are independent.

Are the reported standard errors correct? How can we "check" it and what can we do about it?


Standard Errors

Which standard errors are "correct"?

The answer to this question depends on what we consider random and what we consider fixed. In the frequentist sense, standard errors measure uncertainty across "states of the world" or "parallel universes" that would have happened if we had seen a different realization of the random part of the data-generating process.

In this particular case, there is one variable that is uncontroversially random: the treatment assignment, in fact, we randomized it ourselves. Not only that, we also know how it is random: each customer had a 50% chance of seeing the carousel and a 50% chance of not seeing it.

Therefore, we would like our estimated standard errors to measure the variation in estimated treatment effects across alternative treatment assignments. This is normally an abstract concept since we cannot re-run the exact same experiment. However, we can in our case, since we are in a simulated environment.

The DGP class has a evaluate_f_redrawing_assignment function that allows us to evaluate a function of the data f(X) by re-sampling the data drawing different treatment assignments. The function we want to evaluate is our linear regression for treatment effect estimation.

def estimate_treatment_effect(df):
    reg = smf.ols("revenue ~ carousel", data=df).fit()
    return reg.params["carousel"], reg.bse["carousel"]

We repeat the treatment effect estimation over 1000 different random assignments.

n_draws = 1_000
ols_coeff = dgp.evaluate_f_redrawing_assignment(f=estimate_treatment_effect, n_draws=n_draws)

In the figure below, we plot the distribution of the estimated OLS coefficients and standard errors. In the plot of the estimated standard errors (on the right), we also report a vertical line with the standard deviation of the estimated coefficients across simulations (on the left).

def plot_coefficients(coeffs):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

    # Estimates
    coeffs_est = list(zip(*coeffs))[0]
    sns.histplot(coeffs_est, color="C1", ax=ax1)
    ax1.set(title=f"Distribution of estimated coefficients")
    ax1.legend([f"Average: {np.mean(coeffs_est):.3f}nStd. Dev: {np.std(coeffs_est):.3f}"]);
​
    # Standsed error
    coeffs_std = list(zip(*coeffs))[1]
    sns.histplot(coeffs_std, color="C0", ax=ax2)
    ax2.set(title=f"Distribution of estimated std. errors")
    ax2.legend([f"Average: {np.mean(coeffs_std):.3f}nStd. Dev: {np.std(coeffs_std):.3f}"]);
    ax2.axvline(np.std(coeffs_est), lw=2, ls="--", c="C1")
plot_coefficients(ols_coeff)
Distribution of regression coefficients and standard errors across simulations, image by Author

​The average coefficient is very close to zero. Indeed, the true treatment effect is zero. However, the standard deviation of the estimated coefficients is 2.536, fairly different from the estimated standard deviation of 0.576 we got from the linear regression table. This standard deviation would have implied a p-value of around 0.5, extremely far from any statistical significance threshold. On the right panel, we see that this is not by chance: standard OLS consistently underestimates the variability of the coefficient by around 5 times.

Could have we detected this issue in reality, without the possibility or re-randomizing the treatment assignment? Yes, by bootstrapping standard errors. If you never heard of bootstrapping, I wrote an article about it, and an interesting technique to make bootstrapping faster: the Bayesian bootstrap.

The Bayesian Bootstrap

Bootstrapping essentially consists of drawing samples of the data with replacement, and re-computing the target Statistics across bootstrapped samples. We can then estimate its standard deviation by simply computing the standard deviation of the statistics across samples.

In this case, it's important to sample the data consistently with the data generating process: by customer and not by order.

boot_estimates = []
customers = df.customer_id.unique()
for k in range(1000):
    np.random.seed(k)
    boot_customers = np.random.choice(customers, size=len(customers), replace=True)
    df_boot = pd.concat([df[df.customer_id == id] for id in boot_customers])
    reg = smf.ols("revenue ~ carousel", data=df_boot).fit()
    boot_estimates += [(reg.params["carousel"], reg.bse["carousel"])]

In the figure below, we plot the distribution of estimated coefficients and standard errors. The standard errors are still wrong, but the distribution of bootstrap estimates has a standard deviation of 2.465, very close to the simulated value of 2.536.

plot_coefficients(boot_estimates)
Distribution of regression coefficients and standard errors across bootstraps, image by Author

Note that to use bootstrapping to detect the issue in the linear regression standard errors, one would have needed to be aware that the assignment was at the customer level. Bootstrapping the data at the order level would have still underestimated the standard errors.


Clustered Standard Errors

What's the problem with the standard errors reported in the linear regression table?

The problem is that the usual formula to compute standard errors in linear regression (more on the math later) assumes that observations are independent and identically distributed, i.i.d.. However, in our case, the independence assumption is not satisfied. Why?

In our example, the unit of treatment assignment, a customer is different from the unit of observation, an order. This is a problem because, under different treatment assignments, all orders of a certain customer are either treated or not. They move in blocks and it cannot happen that two orders of the same customer are split across control and treatment groups. The consequence of orders "moving in blocks" is that there is more variability in our estimates than we might have if orders were moving independently. Intuitively, there is less balance between the two groups, on average. As a consequence, standard errors computed assuming independence typically underestimate the actual variability of the estimator.

Is there a solution? Yes!

The solution is to use the so-called cluster-robust standard errors that allow observations to be correlated within a cluster of observations, a customer in our case. Cluster-robust standard errors are usually very simple to implement and available in all statistical packages. With statsmodels we have to specify that the covariance type is cluster and that the groups are by customer.

smf.ols("revenue ~ carousel", data=df).fit(cov_type='cluster', cov_kwds={'groups': df["customer_id"]}).summary().tables[1]
Regression table summary with cluster-adjusted standard errors, image by Author

The estimated cluster-robust standard error is equal to 2.471, extremely close to the simulated standard error of 2.536. Note that the estimated coefficient has not changed (-1.6292 as before).

How do cluster-robust standard errors work? We dig deeper into the math in the next section.

Some Math

The general formula of the variance of the OLS estimator is

Variance of the OLS estimator, image by Author

where X are the regression covariates, and ε are the residuals. The key component of this formula is the central matrix n × n matrix Ω = ε ε', where n is the number of observations. It is key because it is the only object that we need to estimate in order to compute the variance of the OLS estimator.

At first, it could be very tempting to just estimate Ω using the regression residuals e = y – ŷ, where y is the vector of outcomes, and is the regression predictions. However, the problem is that by construction the product of X and e is zero, and therefore the estimated variance would be zero.

X.T @ e
array([[5.72555336e-11],
       [6.68904931e-11]])

The most simplifying assumption is called homoscedasticity: regression residuals are independent of each other and they all have the same variance. In practice, homoscedasticity implies that the Ω matrix is diagonal with the same number in each cell, σ².

Omega matrix under homoscedasticity, image by Author

where Iₙ is the identity matrix of dimension n.

Under homoscedasticity, the OLS variance simplifies to

OLS coefficient variance under homoscedasticity, image by Author

and it's estimated by plugging in the variance of the residuals

y_hat = X @ inv(X.T @ X) @ (X.T @ y)
e = y - y_hat
np.var(e)
245.20230307247473

So the estimated variance of the OLS estimator is given by

Estimated variance of the OLS coefficients under homoscedasticity, image by Author
np.var(e) * inv(X.T @ X)
array([[ 0.14595375, -0.14595375],
       [-0.14595375,  0.33171307]])

The estimated standard error is just the squared root of the bottom-right value.

np.sqrt(np.var(e) * inv(X.T @ X))[1,1]
0.5759453727032665

The computed standard error is indeed the same that was reported before in the linear regression table, 0.576.

Homoscedasticity is a very restrictive assumption, and is usually relaxed, allowing different residual variances across observations. This assumption is called heteroscedasticity.

Omega matrix under heteroscedasticity, image by Author

Under heteroscedasticity, the formula of the variance of the OLS estimator does not simplify anymore.

Sigma = np.eye(len(df)) * (e @ e.T)
np.sqrt(inv(X.T @ X) @ X.T @ Sigma @ X @ inv(X.T @ X))[1,1]
0.5757989320663413

In our case, the estimated standard error under heteroskedasticity is virtually the same: 0.576.

In certain scenarios, even the heteroscedasticity assumption can be too restrictive, when observations are correlated and regression residuals are not independent. In that case, we might want to allow certain off-diagonal elements of the Ω matrix to be different from zero. But which ones?

In many settings, including our example, it is reasonable to assume that observations are correlated within certain clusters, but with the same within-cluster variance. For example, if clusters are observation pairs, the Ω matrix takes the following form.

Omega matrix under cluster correlation, image by Author

We can now compute the estimated standard errors under intra-cluster correlation.

customer_id = df[["customer_id"]].values
Sigma = (customer_id == customer_id.T) * (e @ e.T)
np.sqrt(inv(X.T @ X) @ X.T @ Sigma @ X @ inv(X.T @ X))[1,1]
2.458350214507729

We get indeed the same number reported in the linear regression table. But what's the intuition behind this formula?

Intuition

To get an intuition for the estimated standard errors, imagine that we had a simple regression model with a single covariate: a constant. In that case, the estimated cluster-robust variance of the OLS estimator is just the sum of all the cross-products of residuals within each cluster.

Variance of the OLS estimator with no controls, image by Author

where g indexes clusters and ng is the number of observations within cluster g. If observations are independent, the expected value of the product of the residuals of different observations is zero

Tags: Causal Data Science Causal Inference Data Science Editors Pick Statistics

Comment