The Colorful Power of Permutation Tests

Machine learning and statistics often seem daunting due to their complex mathematical foundations. However, some concepts, like the permutation test, are surprisingly intuitive and can be understood through simple experiments. Permutation tests are valuable tools for assessing the significance of results in various fields, from psychology to Data Science. Let's explore this powerful concept through an engaging example: learning the names of colors in German. Here, you are the ML, a sophisticated neural network, only created naturally by giving birth to a child!
The Color Learning Experiment

Imagine I give you a dataset that names each color using the same colored ink. You study it for a few minutes, memorize the color names, and then I test you. Impressively, you name all the colors correctly except for one small mistake.
But how can we be sure that the German language makes sense or that your brain is functioning properly? One way to check is by using a permutation test.
Shuffling the Labels

The idea behind a permutation test is simple. We take the original dataset and randomly shuffle the color labels to create a new dataset. Now, when you try to learn from this shuffled data, it gives you a headache because there's no meaningful relationship between the colors and their labels. If I test you on this shuffled dataset, you'll likely make many mistakes.
We repeat this shuffling process 100 times, each time counting the number of mistakes you make. Let's say that in one of those 100 attempts, you somehow got all the colors right just by chance. So out of 100 attempts, you performed better on the shuffled data than on the original dataset only once. Your p-value will be 0.01; this is below the common sense threshold of 0.05. So, the German language makes sense, and your brain is fully functional to figure out the association between colors and their names in German.
Calculating the P-Value
The number of times you achieved a better result on the shuffled data compared to the original, divided by the total number of attempts, gives you the p-value. It's that simple and powerful, without the need for complex statistical distributions or fancy tests. The only drawback is that it can be computationally expensive.
Stroop Effect
This mismatched color and text example is also known as the Stroop effect, which is used in psychology to demonstrate executive functions. Executive functions are cognitive processes that allow you to plan, focus attention, remember instructions, and juggle multiple tasks successfully. The Stroop effect reveals the conflict between different information sources in your brain. When the name of a color (e.g., "red") is printed in a color not denoted by the name (e.g., the word "red" printed in blue ink), you experience a delay in reaction time, showing how your brain struggles to process conflicting information.
If you're a native German speaker, you'll likely struggle to name the colors instead of reading the text. To try it out, say the color of the text rather than the word itself – it's trickier than it sounds!
I also designed a t-shirt with the Stroop effect for you. It will help you remember this concept every time you look in the mirror and serve as a nice icebreaker for your events and coffee breaks.

Stroop Effect Unisex T-Shirt – Cognitive Test Tee for Scientists & Geeks
Real-World Applications
Permutation tests have diverse applications across various fields. In psychology, they are used to assess the significance of differences between groups or conditions. In genetics, permutation tests help identify significant genetic associations. In Machine Learning, they can be employed to evaluate the performance of models and the quality of data. By understanding the versatility of permutation tests, readers can appreciate their value in both research and practical settings.
Evaluating Any Machine Learning Models
Now, let's say a data scientist has developed a model trained on your data, and you want to evaluate its performance without knowing what is under the hood.
Here's what you can do:
- Train the model on the original data and record its score.
- Shuffle the labels and train the model again.
- Repeat this process 100 times or more and count how many times the model performs better on the shuffled data compared to the original.
The proportion of times the model performs better on the shuffled data gives you an estimate of the p-value. If this value is high, it indicates that either the model is not learning effectively or the data itself is not informative.
Python Code Example
No data science post is complete without Python code! Here, I use the Iris dataset (Fisher 1936, CC BY 4.0 license) to demonstrate this concept with two classifiers: a weak classifier and a very strong one.
The idea here is to shuffle the labels so we remove possible associations between features and targets. Then we use the classifier of choice, and calculate its performance over the shuffled data using the cross-validation method of choice. We repeat it, let's say, 1000 times. The number of times that we get a better score out of shuffled data compared to the original untouched data is an estimate of p-value. We check it over two different classifiers, one very weak (random) and one very strong. In summary, we showed that for sklearn.datasets.load_iris
data, and using sklearn.model_selection.permutation_test_score
, we can show that sklearn.dummy.DummyClassifier
cannot find any significant statistical association between input features and iris types, while sklearn.ensemble.HistGradientBoostingClassifier
discovers that.
# Load the libraries.
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold, permutation_test_score
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
# The Iris dataset (Fisher 1936, CC BY 4.0 license) is publicly available at:
# https://archive.ics.uci.edu/ml/datasets/iris
Dummy Classifier
I use sklearn.dummy.DummyClassifier
with the strategy="stratified"
. At prediction time, it will output classes randomly, so by just looking at the labels, you cannot tell it is a dummy classifier.
# Load the Iris dataset
X, y = load_iris(return_X_y=True)
# Use StratifiedKFold to make sure all lables are present.
cv = StratifiedKFold(2, shuffle=True)
# The DummyClassifier, I use "stratified" strategy so it spits out all the classes when predict is called.
dummy_clf = DummyClassifier(strategy = "stratified")
# Compute the Permutation Test score
score_dummy, permutation_scores_dummy, pvalue_dummy = permutation_test_score(dummy_clf, X, y,
scoring="accuracy",
cv=cv,
n_permutations=1000)
# Display results for DummyClassifier
print(f"DummyClassifier score: {score_dummy:.4f}, p-value: {pvalue_dummy:.4f}")
DummyClassifier score: 0.3200, p-value: 0.6274
The p-value is way larger than the common sense p-value of 0.05. This is not a good classifier and could not find the relation between features and classes. The histogram shows that it is a useless classifier, as its score and the score of permutations are overlapping.
# Plot the histogram of pemutation scores (blue) and show dummy classifier score (red)
plt.hist(permutation_scores_dummy, alpha=0.5)
plt.vlines(score_dummy, 0, 300, 'r')
plt.xlabel('accuracy')
plt.title('Histogram of accuracies of permutations in Dummy classifier')

The Real Classifier
This time we use sklearn.ensemble.HistGradientBoostingClassifier
, which is much stronger at finding relations between features and targetsxlabelet's check it out in action.
# Initialize the HistGradientBoostingClassifier
hgbc_clf = HistGradientBoostingClassifier()
# Compute the permutation test score
score_hgbc, permutation_scores_hgbc, pvalue_hgbc = permutation_test_score(hgbc_clf, X, y,
scoring="accuracy",
cv=cv,
n_permutations=1000,
n_jobs=-1)
# Display results for HistGradientBoostingClassifier
print(f"HistGradientBoostingClassifier score: {score_hgbc:.4f}, p-value: {pvalue_hgbc:.4f}")
HistGradientBoostingClassifier score: 0.9267, p-value: 0.0010
The score is quite high compared to the last time and is close to the possible maximum of 1 for the accuracy score. The p-value is also 5 times less than the threshold of 0.05. Now let's see the histogram of permutation scores and compare it to the original score.
# Plot the histogram of pemutation scores (blue) and show hgbc classifier score (red)
plt.hist(permutation_scores_hgbc, alpha=0.5)
plt.vlines(score_hgbc, 0, 250, 'r')
plt.xlabel('accuracy')
plt.title('Histogram of accuracies of permutations in HGBC')

I put the code in a Google Colab notebook, so you can clone it and start having fun right away!
Conclusion
The permutation test is a powerful and intuitive tool for assessing the significance of results in machine learning and Statistics. By shuffling labels and comparing performance, we can gain insights into the validity of our models and the quality of our data, all without the need for complex mathematical concepts.
So the next time you encounter a machine learning model, remember the color learning example and the permutation test. It's a simple yet effective way to evaluate performance and make informed decisions in your data science projects.
Thank you for reading! If you liked my writing, follow me and clap to motivate me! You may like my other article as well:
This article minimally edited by AI using MEditor.