Why Understanding the Data-Generation Process Is More Important Than the Data Itself

Author:Murphy  |  View: 23787  |  Time: 2025-03-22 23:48:40
Photo by Ryoji Iwata on Unsplash

During the early stages of infancy, our brains already learn to associate correlation with causation and try to find an explanation for everything happening around us. If a car behind us takes the same turns we do for a long time, we assume it's following us, which is a causal assumption. However, when we snap out of the movie mood, we then think we are properly just heading to the same destination – – a confounder. A common cause introduces a correlation between the two cars' movements. This vivid and relatable example that Pearl gives proves how the human brain works.

What about the correlations that we couldn't fathom a reasonable explanation? Such as two diseases that are uncorrelated among the whole population but correlated among the hospitalized population. If you recall my last article that discussed different causal structures, it points out that conditioning the colliders (hospitalized) generates an explain-away effect that makes two uncorrelated variables spuriously correlated. In other words, the hospitalized population is not an accurate representation of the general population, and any observations made from this sample cannot be generalized.

Collider Bias, image by author based on "The Book of Why" Chapter 6

Collider-induced correlations are not intuitive to human brains, thus generating these so-called paradoxes. In this article, I will explore more interesting paradoxes that create optical illusions in our brains like magic tricks but can be explained with causal diagrams. Understanding what's causing these paradoxes is meaningful and educational. It is the 4th article for the "Read with Me" series, and it's based on Chapters 5 and 6 from "The Book of Why" by Judea Pearl. This should be a fun read based on all the examples Pearl gives in these two chapters! You can find the previous article here:


Photo by Glenn Carstens-Peters on Unsplash

Let's make a Deal

The American television show "Let's Make a Deal" features a famous statistical problem known as the "Monty Hall Problem," named after the show's host, Monty Hall. In this game, a participant is facing three doors, behind one of which is a highly sought-after prize such as a car, and behind the other two are less desirable prizes, often humorously referred to as goats. The rules are:

  • Step 1: The participant chooses one of the three doors as an initial selection, say door 1;
  • Step 2: The host, who knows what is behind each door, opens one of the other two doors to reveal the goat, say door 3;
  • Step 3: The participant has the option to stick with their original choice or switch to the remaining unopened door, i.e., the participant can stay with door 1 or choose door 2 instead.

The best strategy for winning the car is always to switch after the host reveals a goat behind one of the other doors. If you want to see the detailed calculation, you can find it in Question 9 of this article using Bayesian statistics.

12 Probability Practice Questions for Data Science Interviews

Or, you can refer to the following table that shows three scenarios and the outcomes of switching or not switching, given the participant has chosen door 1. For first row shows that the car is in door 1, and the host can open either door 2 or door 3 to reveal the goat. If the participant switches, they will lose. In row 2, the host has to reveal door 3 for the goat, and similarly, the host can only open door 2 for the goat. Since the participants are equally likely to choose any door as the initial selection, comparing the columns of "Switch" and "Not Switch," we can see switch door has a higher probability of winning at 2/3 compared to not switching at 1/3. Thus, switching doors is a preferred strategy.

Table by author based on "The Book of Why" Chapter 6

This may seem counterintuitive since it implies the show producer somehow read the participants' minds about their initial selection. The switching strategy is preferred compared to not switching, indicating the participants' initial choice is correlated with the location of the car, while intuitively speaking, these two variables should be independent of each other. What's the explanation here? We can solve this paradox through this causal diagram:

Image by author based on "The Book of Why" Chapter 6

This causal diagram shows collider bias. Step 2 of the show rule indicates that based on the car's location and the participant's first choice, the host will only choose to open the goat door. This rule actually adds information. The fact that the host opened door 3 instead of door 2 makes door 2 more likely to have a car behind. Conditioning on the collider, in this case, given that the host opened door 3, the participants' first choice, door 1, is less likely to be the car location since switching is a better strategy.

What if we change the game rule that in Step 2, the host can open any door even if it has the car behind it? The corresponding outcome for participants who chose door 1 initially are presented in the following table with all possible scenarios:

Table by author based on "The Book of Why" Chapter 6

Now, if we calculate the chances of winning from switching or not switching based on the outcomes in the last two columns, we can see that there is no difference. The winning probability for both strategies is at 1/3. Combining with the causal diagram:

Image by author based on "The Book of Why" Chapter 6

Modifying the rule in Step 2 removes the causal relationship between the door opened by the host and the car location. Thus, the spurious correlation between the participant's initial selection and where the show producer put the car has disappeared.

In this case, modifying the data generation process (the rules) points to different strategies even with the same data (the door the host opens).


Revisit Simpson's Paradox

Simpson's Paradox is also a well-known confusion widely discussed by statisticians. When we observe different relationships between two variables in the population and subpopulation, which result should we trust? For example, if a medicine is found harmful for the male group and female group separately, but beneficial for the whole population, should we still approve the medicine?

Photo by Brendan Church on Unsplash

Pearl argues

the confusion over Simpson's paradox is a result of incorrect application of causal principles to statistical proportions.

By using causal diagrams to illustrate different data generation processes, even when working with the same data, we can arrive at vastly different conclusions. Let's examine an example of a drug designed to reduce the risk of heart attack.

Scenario #1: Trust the subgroups

The table below compares heart attack rates between treatment and control groups for two gender subpopulations and the whole population with synthetic data. Note the study is not randomized.

Table by author based on "The Book of Why" Chapter 6

By comparing heart attack rates, we can see the drug harms the female and male subgroups but is helpful for the pooled. Which result should we trust? To answer this question, we need to use the following causal diagram:

Image by author based on "The Book of Why" Chapter 6

The diagram shows gender as a confounder for drug and heart attack that will bias the causal relationship without conditioning on it. In other words, in the "Total" row, the 22% and 18% show the biased estimations.

To calculate what is the drug's impact on the overall population, we need to estimate the drug's impacts on different gender groups and then take a weighted average.

Using the data above, we know in the control group, the female and male heart attack rate is 5% and 30%. Assuming females and males are equally frequent in the general population, the control group's heart attack rate is (5%+30%)*0.5 = 17.5%.

Similarly, we can get the treatment group's heart attack rate is (8%+40%)*0.5 = 24%. Overall, it is still higher than the control group, which is consistent with what we found in the subgroups. Thus, this medicine should not be approved.

Scenario #2: Trust the aggregation

Now, let's move to another scenario with the same data. The only difference is that instead of separating by gender, we separated it by the patient's blood pressure levels.

Table by author based on "The Book of Why" Chapter 6

Here, the data reveals the same contradictions between the separated and aggregated groups. However, the different causal diagram changes our conclusion. The idea is that the drug is designed to reduce the heart attack risk by reducing a patient's blood pressure:

Image by author based on "The Book of Why" Chapter 6r

The causal diagram shows blood pressure as a mediator instead of a confounder. Recall what we discussed in the last article: controlling the mediator or the proxy of the mediator will block the causal impact completely or partially.

Thus, pooling the data would be appropriate. We can conclude that the drug is beneficial as it reduces heart attack risk from 22% to 18% in the treatment group compared to the control group.


Simpson's Paradox in Continuous Variables

Now, we have seen the importance of examining beyond data to look at the data-generation process when deciding whether to aggregate or separate by groups. Let's see an example that applies the same concept in continuous variables.

The paradox was first presented by Frederic Lord in 1967 when studying the effects of the diet provided by different dining halls, specifically, how the diet affects girls and boys differently.

The student's weight is measured in September as the initial weight and again in the following June to calculate the weight gained. The result is shown in the following graph:

Image by author based on "The Book of Why" Chapter 6

The two ellipses represent the scatter plots of initial weight and final weight for girls and boys. The graph itself shows contradicting conclusions in terms of the diet's impact on girls and boys:

  • If we focus on the 45-degree line where W_f =W_i, we can see that, on average, the diet has no impact on both girls and boys because the weight distribution is symmetric at W_f =W_i;
  • If we argue that initial weight is influential to the final weight and focus on each initial weight observation, such as W0, we can see that the vertical line intersects the boys' ellipse higher up than the girls'. In other words, on average, boys starting with weight W0 will reach a greater final weight than girls starting with the same initial weight.

The graph shows that boys gain more weight than girls starting with every initial weight. Yet it's equally obvious that both boys and girls gained nothing at all. These two conclusions give different guidance to school dietitians. Which decision should they make? That's when the following causal diagram comes in handy:

Image by author based on "The Book of Why" Chapter 6

When studying whether sex has impacted weight gain, the causal diagram shows there is no backdoor path. So, the observed, aggregated data provide the accurate conclusion that the diet has no impact on both girls and boys. If we control initial weight, we will control a mediator that actually blocks part of the causal impact from sex to weight gain through initial weight.

This conclusion also proves that the overall gain is not always the average of stratum-specific gains (sure-thing principle). If the number of girls and boys is the same for each strata, i.e., each initial weight, then what we observed at each initial weight can be generalized to the whole population. It is clearly not the case here since the distributions of girls and boys are different at each initial weight.

What if we put different narratives in the same data? Instead of observing how sex impact weight gain, we can serve different diet at different dining halls and compare the weight gain for students eating in different dining halls. The graph now becomes:

Image by author based on "The Book of Why" Chapter 6

We can see that students who go to dining hall B have higher initial weights. The same contradicting conclusion can be drawn since the data is the same. However, the causal diagram has changed:

Image by author based on "The Book of Why" Chapter 6

Now, there is a backdoor path to diet and weight gain. Initial weight has become a confounder that will bias the causal relationship between diet and weight gain if we fail to control it.

Thus, we should go with the conclusion that takes initial weights into consideration. From the graph, we can see on average, students going to dining hall B gained more weight than students going to dining hall A.

From all these examples above, we can see that Simpson's Paradox is not just a reversal that we find in real-world data sets all the time. By reversal, I mean if we observe A/B > a/b, and C/D > c/d, it is not guaranteed that (A+C)/(B+D) > (a+c)/(b+d). It is merely rules in arithmetic inequality.

However, Simpson's Paradox is beyond mathematic calculation. It is necessary to thoroughly analyze the causal model to comprehend the data-generating process rather than solely relying on the observed data when facing contradictory conclusions.


The long debate between smoking and lung cancer

Besides providing real-life examples in Chapter 6 to illustrate the importance of understanding the data-generation process, Pearl also explores and focuses on the causal relationship between smoking and lung cancer in Chapter 5 to prove the point.

Photo by Volodymyr Hryshchenko on Unsplash

From the lengthy debate on whether smoking is a causal factor, we can infer how the theory developed and Causality was brought to the front stage over time.

The studies on whether smoking causes lung cancer are challenging due to at least these three factors:

  • It is unethical to run RCT that forces participants to smoke and expose them to potential harm to health;
  • Observational data are likely to have confounders since the distribution of the treatment group can be fundamentally different from the control group.
  • The potential obstacles and deliberate efforts the tobacco companies can put to deceive the public about the health risks.

In the past decades, statisticians have had multiple trials to find the correct methodology, even facing these obstacles.

A case-control study

As Pearl summarizes in his book, it began with a case-control study that compared individuals with lung cancer to a control group of healthy volunteers. The study then determined whether the "cases" in the treatment group had more smokers than in the control group, taking into consideration some confounders like age, sex, and exposure to environmental pollutants.

However, this study faced several drawbacks:

  • A backward logic: instead of answering what is the probability that a smoker can get cancer, it is showing what is the probability that a cancer patient is a smoker;
  • Recall bias: Even though the researcher made sure not to let the patients know about their diagnosis prior to the survey, they can probably infer it based on their own health condition;
  • Selection bias: The study is only focused on hospitalized cancer patients, which is not a good representation of the whole population or even the smoking population.

Cornfield's inequality

The debate continues when questions arise regarding whether smokers are "self-selecting" that they are genetically different from nonsmokers. Some concerns Pearl mentions in his book are"

They are more risk-taking or more likely to develop addictive behaviors that actually cause adverse health effects.

In other words, the debate goes to which of the following causal diagrams is correct. Is the "smoking gene" the only reason we observe relationships between "smoking" and "lung cancer"?

Image by author based on "The Book of Why" Chapter 5

Cornfield challenged the left causal diagram using simple math. If the left causal diagram is true and in the data, we observe smokers are nine times more likely to develop lung cancer, it must be because they are nine times more likely to have the smoking gene than non-smokers.

As illustrated in Pearl's book referring to Cornfield' argument:

If 11% of non-smokers have the smoking gene, then 99% of the smokers would have to have it. However, the percentage becomes mathematically impossible if we observe that 12% of non-smokers have the smoking gene.

In this case, the smoking gene cannot account fully for the relationship between smoking and lung cancer. Thus, the right causal diagram provides a more accurate explanation.

Naming after its proposer, this argument is called Cornfield's inequality, which sheds light on drawing inferences by evaluating how strong alternative relationships must be in the absence of causality to explain the observed data.

The birth-weight paradox

I hope we have already established a foundation well enough to claim that data is not enough to understand causality and explain real-world scenarios that present as paradoxes.

Another example that Pearl gives in his book is a study that focused on finding how mothers who smoke affect their babies' survival rates, challenging the argument that smoke causes mortality. The findings show that babies of smoking mothers indeed weigh less. However, the low-birth-weight babies of smoking mothers had a better survival rate than those of non-smokers.

The finding might look counterintuitive that smoking reduces birth weight and reduces child mortality rate. However, if we think further about what are the factors that contribute to babies' birth weights other than smoking, we will find the collider bias here.

For example, other life-threatening genetic abnormalities might also explain the lower birth rate. We can see it more clearly by drawing the causal diagram:

Image by author based on "The Book of Why" Chapter 5

Here, "smoking -> birth weight <- birth defect" forms a collider structure. Smoking has no correlation with birth defects unless we control birth weight. If we only look at low-birth-weight babies, we are opening up the backdoor path of smoking to mortality. The relationship will be biased by the birth defects to mortality causal path.

Indeed, other factors can add to the causal diagram that will point to birth weight. For example, studies have found race is another factor that affects both birth weight and child mortality.

The list can continue, only one thing we need to keep in mind in terms of what's the action point taken from the causal diagram. As Pearl puts in his book:

We can use this diagram and data to persuade a pregnant woman to stop smoking. But we can't tell a pregnant black woman to stop being black.


Last remarks

Collider bias is everywhere. Have you noticed from your friends' experience that among all the people they date, the attractive ones tend to be jerks? Instead of utilizing psychosocial theories involving traumas and personality disorders, Pearl shows that collider bias can be the answer. Your friends' choices of people to date depend on two factors: attractiveness and personality. In this case, you will date a mean, attractive person, a nice, unattractive person, and certainly a nice, attractive person. However, you will never date a mean, unattractive person. As illustrated by the following causal diagram:

Image by author based on "The Book of Why" Chapter 6

By only looking at your friend's dating pool, you are controlling a collider that will create a spurious negative correlation between attractiveness and personality.

Just as Pearl says, even though the sad fact is that unattractive people are as mean as attractive people in the whole population (Attractiveness and Personality are not correlated), your friends never realize it because the dating pool never has somebody who is mean and unattractive.

This concludes this article full of fun facts and explanations of paradoxes using causal diagrams. Paradoxes are data we observe that trick and confuse us. However, by digging into the data generation process and drawing the causal diagrams, we can provide explanations of these paradoxes, which not only clears the confusion but also gives us trustworthy guidance into decision-making.


That's all I want to share from Chapters 5 and 6 of "The Book of Why" by Judea Pearl, which completes the 4th article in this "Read with Me" series. I hope this article is helpful to you. If you haven't read the first three articles, check them out here:

Read with Me: A Causality Book Club

Data Tells Us "What" and We Always Seek for "Why"

Causal Diagram: Confronting the Achilles' Heel in Observational Data

You Can't Step in the Same River Twice

What Makes A Strong AI?

How is Causal Inference Different in Academia and Industry?

If interested, subscribe to my email list to join the currently ongoing biweekly discussions. As always, I highly encourage you to read, think, and share your main takeaways here or on your own blog.

Thanks for reading. If you like this article, don't forget to:

Reference

The Book of Why by Judea Pearl

Tags: Causal Data Science Causal Inference Causality Deep Dives Statistics

Comment