Rethinking Statistical Significance
Statistical Significance is today a pillar of scientific inquiry, but my recent finding of a plea signed by over 800 researchers in the journal Nature a few years ago brought back to me some long-standing ideas and thoughts about statistical tests having the effect of screwing up a good portion of scientific research rather than helping it.
Traditionally, research findings are classified into "significant" or "non-significant" categories based on a predefined p-value threshold, say 0.05 or 0.01 or 0.001 (these are typical thresholds used in chemical and biological sciences, and certainly differ from those used in other domains). A first problem with this shows up especially among non-experts, who simply misinterpret the result: a non-significant result doesn't negate the presence of an effect; rather, it indicates a lack of conclusive evidence for it. But beyond this, there are other more general problems that I will discuss next, which can affect even experts, especially when there are hidden biases and conflicts of interest, possibly exacerbated in very competitive fields of science and engineering.
To me, and to many others as one can track in the discussions available in the literature, papers should present no statistical tests but rather detailed and balanced plots of the core data, possibly even of raw data, from which readers can distill their own conclusions. In recent years, scientists have stirred these and other similar ideas, arguing that our reliance on the binary measure of "effect vs. no effect" derived from statistical significance tests may be leading us astray, fostering misinterpretations, and, a critical point I think, distorting the scientific landscape.
That's where the "plea" I referred to in the opening paragraph comes in. It consists in a commentary by three researchers in the journal Nature which exposes the concerns of over 800 signatories regarding the misinterpretation and misuse of statistical significance in scientific research. At its core, this critique challenges the conventional practice of categorizing results into binary outcomes based on arbitrary thresholds, such as the commonly used levels for p-values – and not limited to just that exact value or metric, but to any other level and kind of cutoff be it p=0.01 or p=1/10¹⁰, or a confidence interval of 95% or 99.99999%, or… you get the idea… It's all about just not relying blindly on any of these numbers at all, anymore.
At the heart of the issue lies the kind of "dichotomous" and "absolute" thinking that statistical tests bring. And not only to non-specialists as you may think, but also to specialists if they are affected by other biases. The tendency to interpret non-significant results as evidence of "no difference' or "no effect" is pervasive, but it overlooks the nuances inherent in every statistical analysis and can lead to erroneous conclusions. As a result, one can easily dismiss potentially crucial effects simply because they fail to meet an arbitrary threshold of significance. Check for instance this interesting example to see how conclusions can depend on the choice of thresholds and cutoffs on statistical values.
Actions
The plea in Nature doesn't just bring the problem into focus, but it also proposes some actions. The most radical, and I can't but support it, is that it just advocates for abandoning the concept of statistical significance altogether. Rather, researchers are urged to focus on "effect estimates" and their associated uncertainty. And, I add more explicitly than they do, researchers should focus on the transparent presentation of the data, that can be accompanied with possible interpretations but shouldn't contain any "yes or no" indicators computed with a formula or algorithm. That is, scientific articles should present the data as unpolluted as possible, and in the associated text only discuss the authors' interpretations of it.
With the same spirit of fostering the use of "effect estimates", the commentary published in Nature proposes to replace "confidence intervals" by "compatibility intervals". This reframing means one only conveys the range of values compatible with the data, again avoiding the trap of thinking in terms of "effect vs. no effect". Again, the goal is to move away from simplistic yes-or-no decisions based on statistical thresholds and towards a more comprehensive consideration of the evidence at hand.
Note that although my focus is on data in scientific articles, the same idea probably applies to technical reports, documentation material, etc. Perhaps the only exception is on highly standardized protocols part of standard operating procedures, which may mandate the use of specific statistical tests; however, these are usually widely validated and detailed in how they must be carried out, contrary to other situations where tests are executed on the go, often without much prior planning of exact experimental setups, sampling protocols, dataset sizes, etc. – which admittedly are sometimes limited and there's not much that one can do about, for example when trying certain drugs on certain populations.
Effect size as opposed to effect-or-not, and more emphasis on raw data
The proposed shift towards effect size rather than effect vs. no effect acknowledges the practical importance of findings irrespective of their statistical significance, and let it at all more "up to the reader" to draw conclusions. Of course, the persons running the study have their chance to express their conclusions in written form, with all the arguments they need. But the data, is the data.
Of note regarding the point of showing data more widely, I stress at this point that modern computer media allow for a large number of solutions that publishers don't, but should, embrace. Just see, for an example of one of these technologies, the wide array of possibilities for displaying data visually on the web, as I presented recently:
The Most Advanced Libraries for Data Visualization and Analysis on the Web
It is to me unacceptable that in these modern times where interactive media are so pervasive and easy to code, publications remain stuck with static visualizations of data and do not incorporate interactive graphics. Not only do interactive graphics allow for better exploration of the presented data by rotating and zooming on graphs, but it can also allow for operations such as data wrangling (even in web libraries like CanvasXpress that I analyzed in the article linked above) or the presentation of data that is intrinsically 3D in nature, such as molecular structures.
Conclusion
While the proposal to retire statistical significance altogether may seem radical, it reflects a growing recognition of the shortcomings inherent in traditional approaches to statistical analysis. By promoting transparency, a deeper appreciation for uncertainty, and even humility I would say (especially regarding scientific research in highly competitive areas), researchers could probably foster a more rigorous, reliable, and open-minded scientific discourse.
The goal of the (kind of) paradigm shift proposed by the authors of the Nature commentary is that by prioritizing the magnitude of effects, regardless of statistical significance, we can achieve more balanced and less biased understanding of research findings or of benchmarks or any other kind of comparison you might need to do, in any field. Moreover, contextual factors, such as study design and prior knowledge, can then play a higher role in interpreting the results.
The call for this reform extends beyond statistical measures to encompass the broader landscape of scientific reporting. By detailing estimates, uncertainties, and avoiding rigid significance thresholds, researchers could present a comprehensive, more objective portrayal of their findings. While challenging established norms may provoke apprehension, I think that the potential benefits (enhanced accuracy, better backed-up conclusions, and more informed decision-making) all far outweigh the risks.
Key references
If you want to know about one specific example about how conclusions can depend on strict cutoffs on Statistics (out of tons you can find out there!) check this interesting article.
The plea in Nature: Amrhein, V., Greenland, S., & McShane, B. (2024). Scientists rise up against statistical significance. https://www.nature.com/articles/d41586-019-00857-9
www.lucianoabriata.com I write about everything that lies in my broad sphere of interests: nature, science, technology, programming, etc. Subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here. You can tip me here.