Empirical studies provide facts to help guide policy and clinical medicine, or at least they do when the study is done well. Good studies are reproducible, that is, any competent scientist should be able to carry out the same procedures and reproduce my results.
But there is a crisis across science in the reproducibility of results. There are concerns about the replicability of studies in psychology, in economics, and in biology:
Scientists [at] Amgen… tried to confirm… fifty-three papers [that] were deemed ‘landmark’ studies. It was acknowledged… that some of the data might not hold up, because papers were deliberately selected that described something completely new, such as fresh approaches to targeting cancers… Nevertheless, scientific findings were confirmed in only 6 (11%) cases. Even knowing the limitations of preclinical research, this was a shocking result.
There is also concern about the replicability of data analyses in clinical medicine. Which is a bit surprising. Statistical theory is mathematical, but statistical computation boils down to lots of arithmetic. And arithmetic is the paradigm case of reproducibility.
And yet statistical analyses of medical or policy data are not always reproducible. Shanil Ebrahim and his co-authors (including John Ioannidis) report on the reproducibility of randomized clinical trials (RCTs) in a recent JAMA article. RCTs are often relatively simple in design and I would expect that the analysis of an RCT should be reproducible, if anything is.
Ebrahim et al. searched the literature for cases where the same RCT data had been analyzed more than once to answer the same question.
We identified 37 eligible reanalyses in 36 published articles… Reanalyses differed most commonly in statistical or analytical approaches (n = 18) and in definitions or measurements of the outcome of interest (n = 12). Four reanalyses changed the direction and 2 changed the magnitude of treatment effect, whereas 4 led to changes in statistical significance of findings. Thirteen reanalyses (35%) led to interpretations different from that of the original article, 3 (8%) showing that different patients should be treated; 1 (3%), that fewer patients should be treated; and 9 (24%), that more patients should be treated.
The authors correctly present 35% as an upper bound on the irreproducibility of RCT analyses:
many other reanalyses might have been performed that were never published, especially those with results and conclusions identical to those of the original article. Authors of confirmatory reanalyses may… have difficulty publishing their article because many journals may not consider it interesting. Thus, our observed estimate of different conclusions (35%) is probably an overestimate.
What’s important is that reproducibility is less than 100%. Harlan Krumholz and Eric Peterson believe that the problem is pervasive:
…there is evidence from trials that data presented to the US Food and Drug Administration (FDA) may differ in important ways from those originally presented at scientific sessions or published in medical journals. Rising et al assessed clinical trial information provided to the FDA and reported a 9% discordance between the conclusions in the report to the FDA and in the published article. Not unexpectedly, all were in the direction favoring the drug… Hartung et al showed that in a random sample of phase 3 and 4 trials, in 15% the primary end point in the main article was different from the primary end point the trialists reported in ClinicalTrials.gov. Moreover, 22% reported the primary outcome value inconsistently, with some even having differences in the number of deaths.
Krumholz and Peterson recommend that
raw data and metadata (all the information about the data) from the original trial should ideally be made available to those who seek the opportunity to replicate the findings.
What would the effects of this be? Krumholz and Peterson are optimistic:
Such independent verification [through replicated analyses] would markedly increase the scientific community’s confidence in the study findings.
Maybe and maybe not. In policy-sensitive research, partisan analysts would pore over the data, seeking to crucify the authors for small errors, or to exploit the data for post hoc subgroup comparisons. These outsiders would make many errors because they would not understand crucial parts of the methodology. The storm of competing analyses might reduce public confidence in empirical studies.
On balance, however, I think Krumholz and Peterson are right. The knowledge that others would have the opportunity to reanalyze one’s data would make one very careful about the analysis. This care would lead to more highly reproducible analyses.
And sometimes, the reanalysis of data will lead to a better analysis and conclusions that are closer to the truth. Or perhaps there is uncertainty about the best analytical strategy for a given data set. If there are plausible alternate analytical choices that lead to significant differences in a study’s conclusions, we need to know that when we decide how much weight to put on the study.