Why and How to Adjust P-values in Multiple Hypothesis Testing

Multiple hypothesis testing occurs when we repeatedly test models on a number of features, as the probability of obtaining one or more false discoveries increases with the number of tests. For example, in the field of genomics, scientists often want to test whether any of the thousands of genes have a significantly different activity in an outcome of interest. Or whether jellybeans cause acne.
In this blog post, we will cover few of the popular methods used to account for multiple hypothesis testing by adjusting model p-values:
- False Positive Rate (FPR)
- Family-Wise Error Rate (FWER)
- False Discovery Rate (FDR)
and explain when it makes sense to use them.
This document can be summarized in the following image:

Create test data
We will create a simulated example to better understand how various manipulation of p-values can lead to different conclusions. To run this code, we need Python with pandas
, numpy
, scipy
and statsmodels
libraries installed.
For the purpose of this example, we start by creating a Pandas DataFrame of 1000 features. 990 of which (99%) will have their values generated from a Normal distribution with mean = 0, called a Null model. (In a function norm.rvs()
used below, mean is set using a loc
argument.) The remaining 1% of the features will be generated from a Normal distribution mean = 3, called a Non-Null model. We will use these as representing interesting features that we would like to discover.
import pandas as pd
import numpy as np
from scipy.stats import norm
from statsmodels.stats.multitest import multipletests
np.random.seed(42)
n_null = 9900
n_nonnull = 100
df = pd.DataFrame({
'hypothesis': np.concatenate((
['null'] * n_null,
['non-null'] * n_nonnull,
)),
'feature': range(n_null + n_nonnull),
'x': np.concatenate((
norm.rvs(loc=0, scale=1, size=n_null),
norm.rvs(loc=3, scale=1, size=n_nonnull),
))
})
For each of the 1000 features, p-value is a probability of observing the value at least as large, if we assume it was generated from a Null distribution.
P-values can be calculated from a cumulative distribution ( norm.cdf()
from scipy.stats
) which represents the probability of obtaining a value equal to or less than the one observed. Then to calculate the p-value we calculate 1 - norm.cdf()
to find the probability greater than the one observed:
df['p_value'] = 1 - norm.cdf(df['x'], loc = 0, scale = 1)
df

False Positive Rate
The first concept is called a False Positive Rate and is defined as a fraction of null hypotheses that we flag as "significant" (also called Type I errors). The p-values we calculated earlier can be interpreted as a false positive rate by their very definition: they are probabilities of obtaining a value at least as large as a specified value, when we sample a Null distribution.
For illustrative purposes, we will apply a common (magical