• Truth and power

    In the comments, Emily has questioned whether the power calculations we have done for the Oregon Health Insurance Experiment (OHIE) add anything useful to the discussion. (For those calculations look here, here, and here.) She suggests that we might consult with experts to address this question, pointing us to the work of Hoenig and Heisey (ungated PDF here). Related work has been published by Goodman and Berlin. Both papers describe limitations of post-experiment power calculations.

    I have emailed these authors to solicit opinions of our work. I have also emailed several other experienced biostatisticians recommended to me by colleagues. Though not all of these authors and experts responded to my inquiry, those that did could not point to any problems with the type of power calculation we have offered as most relevant and legitimate. (Ironic full disclosure: I am not fully disclosing who replied and said what because I did not obtain explicit authorization to do so. See below for some direct quotes attributed to other experts.)

    One problematic use of post-experiment power calculations is a particular method of attempting to pin the blame for statistically insignificant findings on sample size. One does this by computing power for a study’s statistically insignificant point estimate, a calculation that is guaranteed to show underpowering. This is not, by itself, useful. It’s not, by itself, a test of the power of the study. It’s merely a re-expression of the statistical insignificance of the finding.

    What is more useful is to compute power for the effect size one expected before the study was done. Doing so exploits the “by itself” caveats in the prior paragraph, as it brings in new information not depending on the study in question but on prior work. In fact, one can do this before the study, and one should. This is a test of the study’s detection ability. Of course, one can do the same calculation after the study is done, which doesn’t make it any less legitimate, though a post-experiment version has less to offer since we are also informed by the study’s confidence intervals.

    So, what does a post-experiment power calculation based on expected effect sizes from pre-study literature offer beyond what can be inferred from confidence intervals? First, there is a scientific contribution: it conveys the sample size needed for future studies, given a specified false negative rate (or power level). It cannot be denied that this is of some value.

    Second, it can be more accurate in estimating that sample size than a pre-study power calculation could be. That’s because after the study one can incorporate study features that could not have been known in advance. I’m specifically thinking of the degree of treatment-control (in this case, randomized to Medicaid and not) cross-over. This is precisely what leads to reduction in power due to the instrumental variables design. One could guess in advance what that power reduction would be. But after the study, one knows precisely what it is, as I’ve shown. If one wanted to design the next, similar study, one would absolutely want to incorporate this effect. And we have.

    Third, there is a rhetorical contribution: it is, in part, a re-expression of something that can be inferred from the confidence interval, whether the study was powered to detect the expected effect size. We appreciate — and I have just stated — that this is not a direct, novel scientific contribution. But conveying scientific results in ways that may be better understood or appreciated by a wider audience is part of the dissemination mission. Sometimes saying the same thing in a different way is helpful. So long as such a transformation does not misrepresent the work, it is not harmful. It is not a worthless exercise. As I’ve said, our work does not misrepresent the study’s findings, and the experts I corresponded with did not contradict that. Nor, by the way, did the study authors, with whom we shared our work at multiple stages of progress.

    In an exchange about power calculations in general, Alan Zaslavsky, Professor of Health Care Policy, an Affiliate of the Statistics Department at Harvard, and a leading expert in his field, wrote me:

    [A power calculation] might be relevant as part of a review of what the investigators might reasonably have anticipated [in advance]. For example after such an analysis one might say that a study should have been larger, although in this case that option was unavailable, or that it was so unlikely to detect effects of plausible sizes for some outcomes that they should not have been included in the analytic plan.

    In a post on The New York Times website that cites our work on this (though not the latest and greatest version of it), Casey Mulligan agreed the study was underpowered and explains the implications for interpretation.

    The only way the study could have found a statistically significant result here would be for Medicaid to essentially eliminate an important symptom of diabetes in a two-year time frame. Medicaid coverage could be quite valuable without passing that standard (even the Supreme Court has looked at this issue and concluded that statistical significance is not the only reliable standard of causation).

    The authors of the study appear to be aware of these issues, because they note toward its end, “Our power to detect changes in health was limited by the relatively small numbers of patients with these conditions.” […]

    If the Oregon study prevents even one state from expanding its Medicaid program, Affordable Care Act proponents could assert that [] emphasis on statistical significance has proven to be deadly. Even if you think, as I do, that the law has fatal flaws, the Oregon study of Medicaid is not the place to find them. [Bold added]

    As is clear from Mulligan’s words, the study authors themselves recognized the power limitations our calculations illuminate.

    Given all this, I do think it is safe to say that experts agree that the OHIE study was objectively underpowered for expected effect sizes on physical health measures (perhaps excluding the Framingham risk score, for which there is insufficient prior information to draw a conclusion). I do think it is deceptive to comment on or draw inferences from those results of the study without explicitly acknowledging this fact. Being upfront about this limitation of this study is all we are trying to persuade the world to do. At this point, not doing so gives the impression of deliberately trying to mislead, which is certainly not something we can tolerate here.

    @afrakt

    Share
    Comments closed
     
    • Thank you for going through the trouble of assessing the value of post-experiment power calculations. I note in the abstract that Goodman and Berlin state: “power should play no role once the data have been collected,” and that is what I have heard over and over again though admittedly I have very little experience in this area and therefore cannot draw any personal conclusions. That is why I wanted to know more.

      You discuss that others have stated that they “could not point to any problems with the type of power calculation we have offered”, but did they state that Goodman and Berlin were wrong? Were they stating that the power calculations added something and if they did, what? We have confidence intervals and I don’t think anyone didn’t wish that the sample sizes were greater.

      “it conveys the sample size needed for future studies,”

      I agree, but that would be part of a pre-experiment procedure and this study did not have the luxury of adding patients as they were dealing with a real life situation that was given to them. Real life that is randomized is extremely powerful.

      As you say in your second point it will be more accurate than a pre-study power calculation that follows, but that too involves a pre-experiment procedure not a post-experiment one.

      The third point involving “But conveying scientific results in ways that may be better understood or appreciated by a wider audience is part of the dissemination mission.” is something I believe the authors of the first study were wary of. I thought they inferred that was a danger of post-power calculations being given too much weight and one of the reasons they shouldn’t be done. Isn’t that one of the reasons why we have confidence intervals?

      I don’t think either of the two quotes provided adds anything more to the discussion than was already recognized by everyone. In fact Mulligan summarized the thoughts under question written within the paper by the original authors “Our power to detect changes in health was limited by the relatively small numbers of patients with these conditions”. That doesn’t respond to the underlying question “power should play no role once the data have been collected”

      I think all my important comments have in general been confirmed by the added Goodman and Berlin paper and even the quoted rhetoric of those you quote here in your response.

      Thank you for answering my questions.

      Emily

      • Glad to be of service. Acknowledging all the fine points and questions you raised, it’s the final paragraph of the post that makes the most important point. That is, even if one does have the position that there is zero additional value in any post-experiment power calculations of any type, the study was still objectively underpowered (in a sense observable from the confidence intervals, if you like) to detect expected effect sizes. I can think of no valid argument against that.

        Beyond that, I have no further comment on this subject at this time. But, should I learn more, I will return to it. All the best.

        • Thank you Austin. Yes we agree upon the final point that was mentioned in the original article as well. I think we also would both solidly agree that no study is perfect and that is why we should always integrate other studies as best as possible before drawing any final conclusion that we recognize could radically change with further information.

    • “It’s merely a re-expression of the statistical insignificance of the finding.”

      Dont underestimate reexpression.

      Ask a lay person to interpret an NNT of 20 vs a 5% chance of benefit and you will difft answers or levels of comprehension.

      You can do the same with more sophisticated, but less statistically versed providers and engaged observers. Only in this case, you use power.

      You speak to a larger community than number crunchers and data wonks. Maybe Mulligan’s post and EconTalk would have never occurred without the above original queries.

      However, I get your point.

    • In my own work, I try to frame power calculations in terms of cost/benefit, whenever possible (which can be hard in health–how much is a reduction in mental illness worth). If we know that at a particular effect size the policy becomes net-beneficial, then we want a power calculation around this effect size. A statistically significant finding may not prove it is net-beneficial, but at least we’d know that an insignificant result is evidence that it is not net-beneficial. Hence, if the result is insignificant, we know it is acceptable to treat it as if the effect size was zero.