*The following is a guest post from Jian Gao, PhD, Director of Analytical Methodologies, Office of Productivity, Efficiency and Staffing (OPES), Department of Veterans Affairs.*

For decades we were admonished that saturated fat was not good for our health, but now we are told sugar rather than fat is the culprit. One day, moderate alcohol consumption is touted as wholesome, but on the next, any amount is deemed detrimental. Drinking coffee was once found causing pancreatic cancer but today it is not a problem. Now in real time, experts are battling over the choice of CABG or angioplasty for patients with coronary heart diseases. And most recently, you for sure have seen the headlines surrounding the controversies of COVID-19 treatments.

The irony is that these recommendations are all backed up by “solid” evidence, i.e., research findings. Why are research findings so often inconsistent and even contradictory? Of course, it is multifactorial (e.g., confounders in observational studies and small sample sizes in controlled trials). But one common factor standing out is the misunderstanding of p-values.

In research, the p-value is a practical tool that helps to guard against being fooled by randomness. It gauges the “strength of evidence” against the null hypothesis (a treatment does not work or is ineffective). For example, a p-value of 0.01 indicates the treatment is more likely to work than the one with a p-value of 0.05; however, by design, it cannot tell by how much – there is no basis to infer that the former is five times more likely to work than the latter or vice versa. Unfortunately, the p-value has been widely misinterpreted as the probability that a treatment does not work or a finding is by chance. As a result, the p-value has become the sine qua non for deciding if a study finding is real or by chance, a paper will be accepted or rejected, a grant will be funded or declined, a treatment is effective or harmful, or if a drug will be approved or denied by the FDA.

Of course, a “good” p-value has since turned into a “bad” magic number that researchers chase after. And worse yet, p-hacking has become a commonplace – just like what Andrew Vickers described:

It was just before an early morning meeting, and I was really trying to get the bagels, but I couldn’t help overhearing a conversation between one of my statistical colleagues and a surgeon.

Statistician: “Oh, so you have already calculated the P values?”

Surgeon: “Yes, I used multinomial logistic regression.”

Statistician: “Really? How did you come up with that?”

Surgeon: “Well, I tried each analysis on the SPSS drop-down menus, and that was the one that gave the smallest P value.”

Let alone the selective reporting of research findings with low p-values, most are unaware that the p-value, even if calculated legitimately, is not the probability a treatment does not work. For a p-value of 0.05, the chance a treatment (e.g., reducing dietary fat intake) does not work is not 5%; rather, it is at least 28.9% as elucidated in a recently published article. Given the p-value has been used for 100 years, why are the misunderstanding and misuse of p-values still rampant? Well, the answer lies in the Qs and As posted by George Cobb on the American Statistical Association (ASA) discussion forum:

Q: Why do so many colleges and grad schools teach p = 0.05?

A: Because that’s still what the scientific community and journal editors use.

Q: Why do so many people still use p = 0.05?

A: Because that’s what they were taught in college or grad school.

Regardless, our statistical education in schools falls short: how many students understand the subtle yet fundamental differences between significance and hypothesis testing after taking statistics courses? Most otherwise great statistics textbooks pay too little attention to these fundamentals. On the other hand, few medical journals are willing to see reduced submissions of manuscripts as a result of changing the use of p-values that the established investigators are used to.

Since the publication of the p-value statement by the American Statistical Association in 2016, more and more statisticians and scientists have become aware of the p-value fallacy. Unfortunately, little change has been seen so far in the medical field. It appears that only talking about the p-value is not the chance of a research finding being a fluke can hardly get anywhere. When seeing p-values, users inevitably fall back to the probability a treatment will not work unless they understand how vastly different they are.

In short, the misunderstanding of p-values spells disaster – for a p-value of 0.05, rather than one in 20 as commonly believed, nearly one in three or more research findings are flukes. Now, you can see why there is so much contradictory health advice from experts.

What is the fix? Well, like treating any maladies, we ought to work on the root causes: It is time to back to school. The difference between the p-value and the probability a treatment does not work needs to be taught in the classrooms. Students, researchers, and clinicians need to clearly understand what p-values are and what they are not. And that is not enough — without knowing the probability that a given treatment will or will not work, researchers and clinicians fly blind.

Thanks to James Berger and colleagues’ groundbreaking work, the calibrated p-value was established to inform us the probability that a treatment does not work or a research finding is a fluke. Based on the calibrated p-values, for example, for p-values of 0.1, 0.05, and 0.01, the chances a treatment does not work are at least 38.5%, 28.9%, and 11.1%, respectively, which are strikingly different from what commonly believed: 10%, 5%, and 1%.

Taken together, there is no reason for schools not to teach and medical journals not to use the calibrated p-values, given they are informative and easy to calculate.

For further reading on this topic, a recently published article endeavored to put all the loose ends together.