In the comments, Emily has questioned whether the power calculations we have done for the Oregon Health Insurance Experiment (OHIE) add anything useful to the discussion. (For those calculations look here, here, and here.) She suggests that we might consult with experts to address this question, pointing us to the work of Hoenig and Heisey (ungated PDF here). Related work has been published by Goodman and Berlin. Both papers describe limitations of post-experiment power calculations.

I have emailed these authors to solicit opinions of our work. I have also emailed several other experienced biostatisticians recommended to me by colleagues. Though not all of these authors and experts responded to my inquiry, those that did could not point to any problems with the type of power calculation we have offered as most relevant and legitimate. (Ironic full disclosure: I am not fully disclosing who replied and said what because I did not obtain explicit authorization to do so. See below for some direct quotes attributed to other experts.)

One problematic use of post-experiment power calculations is a particular method of attempting to pin the blame for statistically insignificant findings on sample size. One does this by computing power for a study’s statistically insignificant point estimate, a calculation that is *guaranteed* to show underpowering. This is not, by itself, useful. It’s not, by itself, a test of the power of the study. It’s merely a re-expression of the statistical insignificance of the finding.

What is more useful is to compute power for the effect size *one expected before the study was done*. Doing so exploits the “by itself” caveats in the prior paragraph, as it brings in new information not depending on the study in question but on prior work. In fact, one can do this before the study, and one should. This *is* a test of the study’s detection ability. Of course, one can do the same calculation after the study is done, which doesn’t make it any less legitimate, though a post-experiment version has less to offer since we are also informed by the study’s confidence intervals.

So, what does a post-experiment power calculation based on expected effect sizes from pre-study literature offer beyond what can be inferred from confidence intervals? First, there is a scientific contribution: it conveys the sample size needed for future studies, given a specified false negative rate (or power level). It cannot be denied that this is of *some* value.

Second, it can be more accurate in estimating that sample size than a pre-study power calculation could be. That’s because after the study one can incorporate study features that could not have been known in advance. I’m specifically thinking of the degree of treatment-control (in this case, randomized to Medicaid and not) cross-over. This is precisely what leads to reduction in power due to the instrumental variables design. One could guess in advance what that power reduction would be. But after the study, one knows precisely what it is, as I’ve shown. If one wanted to design the next, similar study, one would absolutely want to incorporate this effect. And we have.

Third, there is a rhetorical contribution: it is, in part, a re-expression of something that can be inferred from the confidence interval, whether the study was powered to detect the expected effect size. We appreciate — and I have just stated — that this is not a direct, novel scientific contribution. But conveying scientific results in ways that may be better understood or appreciated by a wider audience is part of the dissemination mission. Sometimes saying the same thing in a different way is helpful. So long as such a transformation does not misrepresent the work, it is not harmful. It is not a worthless exercise. As I’ve said, our work does not misrepresent the study’s findings, and the experts I corresponded with did not contradict that. Nor, by the way, did the study authors, with whom we shared our work at multiple stages of progress.

In an exchange about power calculations in general, Alan Zaslavsky, Professor of Health Care Policy, an Affiliate of the Statistics Department at Harvard, and a leading expert in his field, wrote me:

[A power calculation] might be relevant as part of a review of what the investigators might reasonably have anticipated [in advance]. For example after such an analysis one might say that a study should have been larger, although in this case that option was unavailable, or that it was so unlikely to detect effects of plausible sizes for some outcomes that they should not have been included in the analytic plan.

In a post on The New York Times website that cites our work on this (though not the latest and greatest version of it), Casey Mulligan agreed the study was underpowered and explains the implications for interpretation.

The only way the study could have found a statistically significant result here would be for Medicaid to

essentially eliminatean important symptom of diabetes in a two-year time frame. Medicaid coverage could be quite valuable without passing that standard (even the Supreme Court has looked at this issue and concluded that statistical significance is not the only reliable standard of causation).The authors of the study appear to be aware of these issues, because they note toward its end, “

Our power to detect changes in health was limited by the relatively small numbers of patients with these conditions.” […]If the Oregon study prevents even one state from expanding its Medicaid program, Affordable Care Act proponents could assert that [] emphasis on statistical significance has proven to be deadly. Even if you think, as I do, that the law has fatal flaws, the Oregon study of Medicaid is not the place to find them. [Bold added]

As is clear from Mulligan’s words, the study authors themselves recognized the power limitations our calculations illuminate.

Given all this, I do think it is safe to say that experts agree that the OHIE study was objectively underpowered for expected effect sizes on physical health measures (perhaps excluding the Framingham risk score, for which there is insufficient prior information to draw a conclusion). I do think it is deceptive to comment on or draw inferences from those results of the study without explicitly acknowledging this fact. Being upfront about this limitation of this study is all we are trying to persuade the world to do. At this point, not doing so gives the impression of deliberately trying to mislead, which is certainly not something we can tolerate here.