• Truth and power, in charts

    The following is a lightly edited version of the contents of an email by reader Rob Maurer. He’s an Associate Professor of Health Systems Management at Texas Woman’s University. In addition to nearly all the words, the charts are his. 

    I looked at the Hoenig & Heisey and the Goodman & Berlin papers that Austin cited. I suspect that the difference between what they discuss and what he and colleagues did might be more clearly expressed in charts.

    The essence of Goodman & Berlin’s argument is the following (p. 202)

    The power of an experiment is the pretrial probability of all nonsignificant outcomes taken together (under a specified alternative hypothesis), and any attempt to apply power to a single outcome is problematic. When people try to use power in that way, they immediately encounter the problem that there is not a unique power estimate to use; there is a different power for each underlying difference.

    To illustrate, start with an ex ante power calculation as depicted in Figure 1. The blue curve is the null distribution (e.g., no Medicaid) and the green curve is the distribution for effect size = 1 (e.g., with Medicaid). The shaded red(ish) area is the type 1 error rate (5%), the shaded green area is the test power, and the black striped area is the p-value for observed effect size = 1 (which results in a failure to reject the null since it is larger than 5%). Clearly, every choice of ex ante effect produces a different test power.

    power calc fig 1

    As for post-experiment (or post hoc) power, Goodman & Berlin then comment (p. 202):

    To eliminate this ambiguity, some researchers calculate the power with respect to the observed difference, a number that is at least unique. This is what is called the “post hoc power.” … The unstated rationale for the calculation is roughly as follows: It is usually done when the researcher believes there is a treatment difference, despite the nonsignificant result.

    In my example, the post hoc power calculation for an observed effect size = 1 is depicted in Figure 2. The fact that we need a smaller standard error to increase the power (to a 5% type 1 error threshold) implies that we need a larger sample. Note that this is equivalent to saying that the p-value and the type 1 error rate, in this case, have the same value.

    powe calc fig 2

    Goodman & Berlin go on to observe (p. 202):

    The notion of the P value as an “observed” type I error suffers from the same logical problems as post hoc power. Once we know the observed result, the notion of an “error rate” is no longer meaningful in the same way that it was before the experiment.

    This comment, identifying a post hoc power calculation with the mistake of identifying the p-value with an observed type 1 error rate, makes clear what Goodman & Berlin have in mind when they say “power should play no role once the data have been collected.” Figure 2 shows how the two are related.

    What I think Austin et al. are doing with their power calculations is depicted in Figure 3 (which is identical to Figure 1 except effect size = 2). I labeled the figure as a post hoc calculation of the ex ante test power to emphasize that, yes, it is after-the-fact but is focused on ex ante power. What this shows is that, given the existing sample size, one needs a large effect to have sufficient power to reject a false null.

    power calc fig 3

    Where I think the discussion in comments to posts on this blog about this subject has gone off track is that “larger effect size” and “larger sample size” get confused. Goodman & Berlin are arguing against a post hoc justification for a larger sample size (Figure 2) where I think Austin et al. are arguing that a larger ex ante effect is required to justify a failure to reject the null given the sample size used in the study (Figure 3).

    One can make the same point by saying that the study did not have sufficient ex ante power (which could have been addressed by increasing the sample size), but that is not the same as a post hoc power calculation.

    I think what Austin et al.’s argument boils down to is that, given the small sample size and resulting standard error, there is a range of small effect sizes that produce an ambiguous result. The ex ante effect size they obtained from the literature can be interpreted to suggest that the observed effect falls in this ambiguous range, hence the need for more power.

    • Rob Maurer, I think I appreciate at least part of what you said. I believe in my first comment on this subject I found the discussion of post power calculations interesting and was wondering about their use as an alternate way of interpreting results, but we were dealing with a specific study, The Oregon Study, and we all knew the big problem that existed. The authors clearly stated: “Our power to detect changes in health was limited by the relatively small numbers of patients with these conditions.” and we have confidence levels.

      I listened as best as I could to the explanations and decided to go to the literature. The literature I read seemed quite clear to me and then was confirmed by another source from Austin: “power should play no role once the data have been collected”. In some of the other work they emphasized that very specific rule because confidence levels and other methods existed and it seemed there was a considerable fear that the mostly repetitive power calculation given more weight than it should be given. That apparent weight (from my end) given in the first several discussions made me very wary of the calculation.

      I think I understand the nuance you are proposing and perhaps that was behind a part of my thinking when I at first thought the power calculation could attach numeric values to something that is not entirely clear. However, after reading everything and listening to your explanation, which I appreciate, I think that the explanation leads me to a better understanding than the numeric calculation which can mean too many things including those things warned against by the authors of the papers that were presented.

      PS: Please note your own words (emphasis mine): “I think what Austin et al.’s argument boils down to…”

      I am not saying that wasn’t Austin’s idea, rather that if you had to think about what was meant then I had to think many multiples harder to just reach near to that conclusion. I think that particular nuance is something that is recognizable without the power calculation and was known in advance as noted by the comment of the authors above.