Sneaky Science: Data Dredging Exposed

Author:Murphy | View: 22742 | Time: 2025-03-23 12:21:42

From pizza to the dark side of research. Image created with Dall·E 3 by the author.

A recent New Yorker headline reads, "They Studied Dishonesty. Was Their Work a Lie?". What's the story behind it? Behavioral economist Dan Ariely and behavioral scientist Francesca Gino, both acclaimed in their fields, are under scrutiny for alleged research misconduct. To put it bluntly, they're accused of fabricating data to achieve statistically significant results.

Sadly, such instances are not rare. Scientific research has seen its share of fraud. The practice of p-hacking – e.g. manipulating data, halting experiments once a significant p-value is achieved, or only reporting significant findings – has long been a concern. In this article, we will reflect on why some researchers might be tempted to tweak their findings. We will show the consequences and explain what you can do to prevent p-hacking in your own experiments.

But before we get into the scandals and secrets, let's start with the basics – a crash course in Hypothesis Testing 101. This knowledge will be helpful as we navigate the world of p-hacking.

Hypothesis Testing 101

Let's recap the key concepts you need to know to fully grasp the post. If you are familiar with hypothesis testing, including the p-value, type I/II errors, and the significance level you can skip this part.

The Best Pizza Test

Let's travel to Naples, the famous Italian city known for its pizza. Two pizzerias, Port'Alba and Michele's, claim they make the best pizza in the world. You're a curious food critic, determined to find out which pizzeria truly deserves this title. To find out, you decide to host "The Best Pizza Test" (which is essentially just an hypothesis test).

Your investigation starts with two hypotheses:

Null Hypothesis (H0): There is no difference in the taste of Port'Alba and Michele's pizzas; any difference observed is due to chance.
Alternative Hypothesis (H1): There is a significant difference in the taste of Port'Alba and Michele's pizzas, indicating that one is better than the other.

The test starts. You gather a group of participants and organize a blind taste test. Each participant is served two slices of pizza, one from Port'Alba and one from Michele's, without knowing which is which. The participants will rate both slices (0–10).

You set a strict alpha level (significance level) **** of 0.05. This means you're willing to tolerate a 5% chance of making a Type I error, which in this context would be falsely claiming that one pizzeria's pizza is better when it isn't.

After collecting and analyzing the data, you find that participants overwhelmingly preferred Michele's pizza. This is what the score distributions look like:

There are two risks in hypothesis testing:

Type I error: **** There's a small chance (5%, equal to the significance level alpha) that you might be making a mistake by concluding that Michele's pizza is better when, in reality, there is no significant difference. You don't want to unfairly discredit Port'Alba pizzeria.
Type II error: On the flip side, there is the Type II error. What if, in reality, Michele's pizza is better, but your test failed to detect that? You'd feel like you missed out on the best pizza!

In a matrix (you can compare it with a confusion matrix):

Type I and Type II error visualized. Image by author.

To be sure of your findings, you calculate the p-value. It turns out to be a tiny number, far less than 0.05. This means the probability of getting such extreme results by chance, assuming the null hypothesis is true, is exceedingly low. We have a winner! Example of calculating the p-value with a two-sample t-test:

import numpy as np
from scipy import stats

# Sample data for taste scores (out of 10) for the pizzerias
np.random.seed(42)
portalba_scores = np.random.normal(7.5, 1.5, 50)  
michele_scores = np.random.normal(8.5, 1.5, 50)
michele_scores = [round(min(score, 10), 1) for score in michele_scores]
portalba_scores = [round(min(score, 10), 1) for score in portalba_scores] 

# Perform a two-sample t-test
t_stat, p_value = stats.ttest_ind(portalba_scores, michele_scores)

# Set the significance level (alpha)
alpha = 0.05

# Compare the p-value to alpha to make a decision
if p_value < alpha:
    print("We reject the null hypothesis: {} < {}".format(round(p_value, 7), alpha))
    print("There is a significant difference in taste between Port'Alba's and Michele's pizzas.")
else:
    print("We fail to reject the null hypothesis: {} >= {}".format(round(p_value, 7), alpha))
    print("There is no significant difference in taste between Port'Alba's and Michele's pizzas.")

Output:

We reject the null hypothesis: 3.1e-06 < 0.05
There is a significant difference in taste between Port'Alba's and Michele's pizzas.

Now that you are familiar with the key concepts (I hope you are not hungry after the pizza story), let's continue with the sneaky part of this post: p-hacking.

Image created with Dall·E 3 by the author.

P-Hacking

In the world of academia, journals serve as the gatekeepers of knowledge. These gatekeepers have a preference for studies with significant results. The pressure to secure a publication spot can be immense. This bias can subtly encourage researchers to engage in p-hacking to ensure their work sees the light of day. Unfortunately, this practice perpetuates an overrepresentation of positive findings in the scientific literature.

P-hacking (aka data dredging or data snooping) is defined as:

Manipulation of statistical analyses or experimental design in order to achieve a desired result or to obtain a statistically significant p-value.

So, in research, it's possible to manipulate your analysis and data to obtain interesting results that you can publish. You keep investigating, analyzing, and sometimes even modifying your data until the p-value is significant. P-hacking is often driven by the desire for recognition, publication, and the allure of significant findings. Researchers may inadvertently or intentionally fall into the p-hacking trap, tempted by the promise of quick acclaim and the pressures of a competitive academic environment.

One easy way of falling into this trap is by extensive exploration of the data. Researchers can find themselves in a situation where the dataset is rich with variables and subgroups waiting to be explored. The temptation to test numerous combinations can be strong. Each variable, each subgroup presents the possibility of finding a significant result. The trap here is cherry-picking – highlighting only those variables and subgroups that support the desired outcome while ignoring the many comparisons made. This can result in misleading and unrepresentative findings.

Another trap is peeking at the p-values before the experiment has officially ended. This might seem innocent, but it isn't. The pitfall here is the temptation to stop data collection prematurely if that coveted p-value threshold is reached. This can lead to biased and unreliable findings, as the sample might not be representative. And even with a representative sample, it is manipulation of the data by stopping at a moment the results give a significant p-value.

Extensive data exploration and peeking at p-values can happen unintentionally easily. The following example cannot: In the pursuit of significance, a researcher may encounter moments when the initial outcome measure doesn't yield the desired result. The desire to adjust the primary outcome, subtly shifting the goalposts, can be enticing.

In summary; How do you NOT commit p-hacking? Here are five tips and guidelines:

Avoid exploring your data extensively to find significant results by chance. If you are doing multiple comparisons, correct the significance level with the Bonferroni correction (divide the value of alpha by the number of experiments).
Don't selectively report only the analyses that yield significant p-values. Keep it transparent and also report analyses that did not yield significant results. This also holds true for variables: Don't test multiple variables and report only those that appear significant. In this case you are cherry-picking variables. Even better, you can consider pre-registering your research to declare your hypotheses and analysis plans in advance.
HARKing (Hypothesizing After Results are Known): Refrain from forming hypotheses based on the results you've obtained; hypotheses should be defined before data analysis.
Avoid stopping data collection or analysis when you achieve a significant result. In line with this: Don't peek at the results before the test has ended. Determine the sample size beforehand. And yes, even looking at the data is wrong!
Ensure that you meet the assumptions of your statistical tests. An example is normality of data. Make sure the data is normally distributed if you use a test that assumes this (you can test this with the Shapiro-Wilk test).

True (Horror) Stories

The consequences of p-hacking are severe, as it leads to false positives (null hypothesis rejected when they are actually true). Research is build upon previous research and every false positive is misinformation for subsequent research and decision-making. Also, it wastes time, money, and resources on pursuing research avenues that do not truly exist. And maybe the most important consequence: It erodes the trust and integrity of scientific research.

Now, it's time for the scandals… There are many examples of Data Manipulation, data dredging and p-hacking. Let's take a look at some true stories.

Diederik Stapel's Fabricated Research

Diederik Stapel was a prominent Dutch social psychologist who, in 2011, was found to have fabricated data in numerous published studies. He manipulated and falsified data to support his hypotheses, leading to a big scandal in the field of psychology.

Before he was caught, he published numerous fake studies (58 papers were retracted, according to Retraction Watch). For example, Stapel faked data that proved that people thinking about meat makes them less social (this wasn't published). Something that was actually published is a study about how a messy environment promotes discrimination. Of course, this study was retracted when his deception came true. Still you can find some of his retracted papers online.

The Boldt Case

A case that is way more shocking than the previous one, is the case about Joachim Boldt. Boldt was once regarded as a prominent figure in the field of medicinal colloids and had been a strong proponent of using colloidal hydroxyethyl starch (HES) to raise blood pressure during surgical procedures. However, a meta-analysis that deliberately excluded the unreliable data attributed to Boldt unveiled a different story. This analysis revealed that the intravenous administration of hydroxyethyl starch is linked to a significantly higher risk of mortality and acute kidney injury when compared to alternative resuscitation solutions. A headline in the Telegraph:

Millions of patients have been treated with controversial drugs on the basis of "fraudulent research" by one of the world's leading anesthetists.

As you would expect, the fallout from these revelations was severe. Boldt lost his professorship, and he became the subject of a criminal investigation, facing allegations of falsifying up to 90 research studies.

What made him publish these false results? Why would anyone risk lives for something like a publication or promotion? Unfortunately, in Boldt's case, we don't know. While he may have provided explanations or statements during investigations or legal proceedings, there is no comprehensive and widely recognized account of his motivations. Here you can find an article about the case with timeline included.

Reproducibility Crisis

There are many more horror stories you can find on the web. In general, reproducibility in research is important. Why? It allows for the verification and validation of research findings, ensuring that they are not simply due to chance, errors, or biases. But reproducing results can be challenging. There can be variations in experimental conditions, differences in equipment, or subtle errors in data collection and analysis. (Or someone committed p-hacking, which makes it impossible to reproduce the study.)

Some researchers tried to show why reproducibility is an issue. A fun example is a study about chocolate, and how it affects weight loss. Despite intentional flaws in methodology, it was widely published and shared. This incident showcased the pitfalls of weak scientific rigor and sensationalized media coverage.

But also when trying to reproduce existing research, challenges arise. In 2011, Bayer's researchers revealed that they could only replicate around a quarter (!) of published preclinical studies. This concerning discovery made many question the reliability of some early-stage drug research, emphasizing the pressing need for good validation processes. Similarly, the field of psychology faced its own issues. To address this, the "Many Labs" project was initiated. This global endeavor had laboratories worldwide trying to reproduce the same psychological study, often revealing notable variations in outcomes. Such findings stressed the essential role of collaborative efforts in multiple labs to ensure research findings are genuine.

Unfortunately, this is not true. Image created with Dall·E 3 by the author.

Conclusion

This post aimed to open your eyes to the subtle yet big influence of p-hacking on scientific literature. It's a reminder that not every research article can be taken at face value. As curious and critical thinkers, it's essential to approach scientific findings with a discerning eye. Don't blindly trust every publication; instead, seek multiple sources that reinforce the same statement. True knowledge often emerges when different studies converge on a shared truth.

But what if there were a solution to one of academia's most pressing challenges? Join me in my upcoming blog post as I unveil a potential antidote to the issue of p-hacking. Stay tuned!