Understanding Causal Trees
How to use regression trees to estimate heterogeneous treatment effects

In Causal Inference, we are usually interested in estimating the causal effect of a treatment (a drug, ad, product, …) on an outcome of interest (a disease, firm revenue, customer satisfaction, …). However, knowing that a treatment works on average is often not sufficient and we would like to know for which subjects (patients, users, customers, …) it works better or worse, i.e. we would like to estimate heterogeneous treatment effects.
Estimating heterogeneous treatment effects allows us to use the treatment selectively and more efficiently through targeting. Knowing which customers are more likely to react to a discount allows a company to spend less money by offering fewer but better-targeted discounts. This works also for negative effects: knowing for which patients a certain drug has side effects allows a pharmaceutical company to warn or exclude them from the treatment. There is also a more subtle advantage of estimating heterogeneous treatment effects: knowing for whom a treatment works allow us to better understand how a treatment works. Knowing that the effect of a discount does not depend on the income of its recipient but rather on its buying habits tells us that maybe it is not a matter of money, but rather a matter of attention or loyalty.
In this article, we will explore the estimation of heterogeneous treatment effects using a modified version of regression trees (and forests). From a machine-learning perspective, there are two fundamental differences between causal trees and predictive trees. First of all, the target is the treatment effect, which is an inherently unobservable object. Second, we are interested in doing inference, which means quantifying the uncertainty of our estimates.
Online Discounts and Targeting
For the rest of the article, we are going to use a toy example, for the sake of exposition: suppose we were an online shop and we are interested in understanding whether offering discounts to new customers increases their expenditure. In particular, we would like to know if offering discounts is more effective for some customers with respect to others since we would prefer not to give discounts to customers that would spend anyways. Moreover, it could also be that spamming customers with pop-ups could deter them from buying, having the opposite effect.

To understand whether and how much the discounts are effective we run an A/B test: whenever a new user visits our online shop, we randomly decide whether to offer them the discount or not. I import the data-generating process dgp_online_discounts()
from [src.dgp](https://github.com/matteocourthoud/Blog-Posts/blob/main/notebooks/src/dgp.py)
. With respect to previous articles, I generated a new DGP parent class that handles randomization and data generation, while its children classes contain specific use cases. I also import some plotting functions and libraries from [src.utils](https://github.com/matteocourthoud/Blog-Posts/blob/main/notebooks/src/utils.py)
. To include not only code but also data and tables, I use Deepnote, a Jupyter-like web-based collaborative notebook environment.
We have data on 100.000 website visitors, for whom we observe the time
of the day, the device
they use, their browser
and their geographical region
. We also see whether they were offered the discount
, our treatment, and what is their spend
, the outcome of interest.
Since the treatment was randomly assigned, we can use a simple difference-in-means estimator to estimate the treatment effect. We expect the treatment and control group to be similar, except for the discount
, therefore we can causally attribute any difference in spend
to the discount
.
The discount seems to be effective: on average the spending in the treatment group increases by 1.95$. But are all customers equally affected?
To answer this question, we would like to estimate heterogeneous treatment effects, possibly at the individual level.
Heterogeneous Treatment Effects
There are many possible ways to estimate heterogeneous treatment effects. The most common is to split the population into groups based on some observable characteristic, which in our case could be the device
, the browser
or the geographical region
. Once you have decided which variable to split your data on, you can simply interact with the treatment variable (discount
) with the dimension of treatment heterogeneity. Let's take device
for example.
How do we interpret the regression results? The effect of the discount
on customers' spend
is 1.22$ but it increases by a further 1.44$ if the customer is accessing the website from a mobile device
.
Splitting is easy for categorical variables, but for a continuous variable like time
it is not intuitive where to split. Every hour? And which dimension is more informative? It would be tempting to try all possible splits, but the more we split the data, the more it is likely that we find spurious results (i.e. we overfit, in machine learning lingo). It would be great if we could let the data speak and select the minimum and most informative splits.
In a separate post, I have shown how the so-called meta-learners take this approach to causal inference. The idea is to predict the outcome conditional on the treatment status for each observation, and then compare the predicted conditional on treatment, with the predicted outcome conditional on control. The difference is the individual treatment effect.
The problem with meta-learners is that they use all their degrees of freedom in predicting the outcome. However, we are interested to predict treatment effect heterogeneity. If most of the variation in the outcome is not in the treatment dimension, we will get very poor estimates of the treatment effects.
Is it possible to instead directly concentrate on the prediction of individual treatment effects? Let's define Y as the outcome of interest spend
, D the treatment discount
, and X other observable characteristics. The ideal loss function is

where τᵢ is the treatment effect of individual i. However, this objective function is unfeasible since we do not observe τᵢ.
But, turns out that there is a way to get an unbiased estimate of the individual treatment effect. The idea is to use an auxiliary outcome variable, whose expected value for each individual is the individual treatment effect. This variable is

where p(Xᵢ) is the propensity score of observation i, i.e. its probability of being treated.
In randomized experiments, the propensity score is known since randomization is fully under the control of the experimenter. For example, in our case, the probability of treatment was 50%. In quasi-experimental studies instead, when the treatment probability is not known, it has to be estimated. Even in randomized experiments, it is always better to estimate rather than impute the propensity scores, since it guards against sampling variation in the randomization. For more details on the propensity scores and how they are used in causal inference, I have a separate post here.
Let's first generate dummy variables for our categorical variables, device
, browser
and region
.
We fit a LogisticRegression
and use it to predict the treatment probability, i.e. construct the propensity score.

As expected, most propensity scores are very close to 0.5, the probability of treatment used in randomization. Moreover, the distribution is almost identical across the treatment and control groups, further confirming that randomization worked. If it had not been the case, we would have needed to make further assumptions in order to conduct a causal analysis. The most common one is unconfoundedness, also known as ignorability or selection on observables. In short, we will assume that conditional on some observables