What Is Scientifically Significant?
A common measure in scientific papers is under fire, possibly invalidating thousands of research papers.
It’s called the “P-value.” Any result from a scientific test deemed 5% or more probable than chance (p < 0.05) is called “statistically significant.” It was an arbitrary threshold established by population geneticist Ronald Fisher, as a way of objectifying the importance of scientific tests with a mathematical value. Thousands of research papers have boasted of their P-values in order to provide an air of reliability to their conclusions, based on tests that appear to confirm their hypotheses.
Now, many scientists want to toss it – not just because the 5% threshold is arbitrary, but because cheaters can fudge it. If you run your test enough times, you might get a P-value of 0.05 by random chance. If you have strong motivation to confirm a pet theory, which value are you likely to report? Conversely, many significant results might have gone unpublished if a test did not reach the threshold.
It’s time to talk about ditching statistical significance (Nature). The Editors of the world’s most famous science journal argue that “truth cannot be revealed by a single number.”
And yet this is the job often assigned to P values: a measure of how surprising a result is, given assumptions about an experiment, including that no effect exists. Whether a P value falls above or below an arbitrary threshold demarcating ‘statistical significance’ (such as 0.05) decides whether hypotheses are accepted, papers are published and products are brought to market. But using P values as the sole arbiter of what to accept as truth can also mean that some analyses are biased, some false positives are overhyped and some genuine effects are overlooked.
Scientists rise up against statistical significance (Amrhein, Greenland and McShane, Nature). These three scientists are even more adamant against P-values. They call misuse of ‘statistical significance’ a “pervasive problem” in science. About half of 791 published studies in 5 journals, they show, misconstrued “non-significant’ as having ‘no effect.’ The writers blame human nature for loving to categorize things into bins, as if the categories cannot overlap.
The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different.
The same problem accrues to any system, they argue, that dichotomizes measures into either-or categories. The writers are not alone, they say. Many have signed onto their criticism of P-value abuse.
We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence.
The Editors and these writers do not recommend abandoning P-value as a statistical tool, but want “an end to their use as an arbitrary threshold of significance.” That was enough, though, to launch a vigorous debate. Scientists responded with differing opinions:
Retiring statistical significance would give bias a free pass (John Ioannidis, Nature). The well-known “meta-researcher” (researcher of researchers) thinks that “irrefutable nonsense would rule” if the P-value were jettisoned. Changes to the threshold could be argued, he says. But “although the obstacle of statistical significance can be surmounted by trickery, removing it altogether is worse.”
Raise the bar rather than retire significance (Valen Johnson, Nature). Johnson agrees with Ioannidis, but thinks ‘statistical significance’ conflates two issues, the main one being whether an association exists between an alleged cause and its effect.
Retire significance, but still test hypotheses (Haaf and Wagenmakers, Nature). These two scientists agree with the Editors that P-values deserve retirement, but some kind of testing (they don’t suggest any other method) is necessary to separate a result from randomness. “Without the restraint provided by testing, an estimation-only approach will lead to overfitting of research results, poor predictions and overconfident claims.”
Testability is one of the strengths of science – perhaps its greatest strength. Scientists despise subjectivity. They want a number to give to the press and stakeholders, showing that their results are objective: i.e., better than mere opinion or guesswork. They want to advertise statistical significance with a number, because a number carries authority. You can trust their results, many scientists feel, because they were tested, and the results passed the test. But how do you test a hypothesis without some kind of objective measurement? A scientist can’t say, ‘Well, I think this result is significant. To me it seems noteworthy.’
This is not an either-or debate. Sometimes P-values help; sometimes they can be fudged. The point is that this bulwark of scientific objectivity is subjective! It’s a convention, an agreed-on standard, subject to change. It’s a bit like having a Constitution for a social contract. Sometimes amendments are sufficient; sometimes a complete overhaul is required.
Science is not “out there” as something that humanity receives from Amazon in a box, opens up, and follows the instructions. Humans make up their instructions as they go! The test the usefulness of the “constitution” of science as they try making sense of the world. Science is a human invention, based on the belief that the world is objectively real, and governed by regularities. Much of the time a P-value works OK as an arbitrary threshold, but when abuses proliferate, amendments may be needed. Even then, fallible humans can find ways around the new rules. Meanwhile, honest scientists can fall prey to interpreting the rules as objective, non-overlapping categories: e.g., “statistically significant” vs “statistically insignificant.”
Another comparison with government should be considered. John Adams wrote, “Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.” We see that today with groups that violate the Constitution with reckless abandon, and courts that rule things “constitutional” or “unconstitutional” for political reasons. It’s the same with science. Give science to a den of thieves, and they will abuse it; you will not learn anything reliable about external reality. You will learn just how desperately wicked the human heart can be.
Exercise: Compare the scientists’ quandary about statistical significance with grading practices in schools. Most schools use an A-B-C-D-F system, but what other systems could be devised? What if F were highest, and A were lowest? What happened to “E”? Why use just 5 levels of grades; why not 6 or 7, a dozen or a hundred? Why not use numbers? Why not just pass or fail? (These arbitrary rules are actually used in some contexts.) Try inventing your own grading system. Can you make it foolproof? What subjective factors enter into grading (such as grade inflation, grading on the curve, teacher’s pets, etc.). Is any grading system completely free of bias or abuse? What are the moral preconditions for a grading system to work?