• Truth and power, in charts

    The following is a lightly edited version of the contents of an email by reader Rob Maurer. He’s an Associate Professor of Health Systems Management at Texas Woman’s University. In addition to nearly all the words, the charts are his. 

    I looked at the Hoenig & Heisey and the Goodman & Berlin papers that Austin cited. I suspect that the difference between what they discuss and what he and colleagues did might be more clearly expressed in charts.

    The essence of Goodman & Berlin’s argument is the following (p. 202)

    The power of an experiment is the pretrial probability of all nonsignificant outcomes taken together (under a specified alternative hypothesis), and any attempt to apply power to a single outcome is problematic. When people try to use power in that way, they immediately encounter the problem that there is not a unique power estimate to use; there is a different power for each underlying difference.

    To illustrate, start with an ex ante power calculation as depicted in Figure 1. The blue curve is the null distribution (e.g., no Medicaid) and the green curve is the distribution for effect size = 1 (e.g., with Medicaid). The shaded red(ish) area is the type 1 error rate (5%), the shaded green area is the test power, and the black striped area is the p-value for observed effect size = 1 (which results in a failure to reject the null since it is larger than 5%). Clearly, every choice of ex ante effect produces a different test power.

    power calc fig 1

    As for post-experiment (or post hoc) power, Goodman & Berlin then comment (p. 202):

    To eliminate this ambiguity, some researchers calculate the power with respect to the observed difference, a number that is at least unique. This is what is called the “post hoc power.” … The unstated rationale for the calculation is roughly as follows: It is usually done when the researcher believes there is a treatment difference, despite the nonsignificant result.

    In my example, the post hoc power calculation for an observed effect size = 1 is depicted in Figure 2. The fact that we need a smaller standard error to increase the power (to a 5% type 1 error threshold) implies that we need a larger sample. Note that this is equivalent to saying that the p-value and the type 1 error rate, in this case, have the same value.

    powe calc fig 2

    Goodman & Berlin go on to observe (p. 202):

    The notion of the P value as an “observed” type I error suffers from the same logical problems as post hoc power. Once we know the observed result, the notion of an “error rate” is no longer meaningful in the same way that it was before the experiment.

    This comment, identifying a post hoc power calculation with the mistake of identifying the p-value with an observed type 1 error rate, makes clear what Goodman & Berlin have in mind when they say “power should play no role once the data have been collected.” Figure 2 shows how the two are related.

    What I think Austin et al. are doing with their power calculations is depicted in Figure 3 (which is identical to Figure 1 except effect size = 2). I labeled the figure as a post hoc calculation of the ex ante test power to emphasize that, yes, it is after-the-fact but is focused on ex ante power. What this shows is that, given the existing sample size, one needs a large effect to have sufficient power to reject a false null.

    power calc fig 3

    Where I think the discussion in comments to posts on this blog about this subject has gone off track is that “larger effect size” and “larger sample size” get confused. Goodman & Berlin are arguing against a post hoc justification for a larger sample size (Figure 2) where I think Austin et al. are arguing that a larger ex ante effect is required to justify a failure to reject the null given the sample size used in the study (Figure 3).

    One can make the same point by saying that the study did not have sufficient ex ante power (which could have been addressed by increasing the sample size), but that is not the same as a post hoc power calculation.

    I think what Austin et al.’s argument boils down to is that, given the small sample size and resulting standard error, there is a range of small effect sizes that produce an ambiguous result. The ex ante effect size they obtained from the literature can be interpreted to suggest that the observed effect falls in this ambiguous range, hence the need for more power.

     
    item.php
  • Truth and power

    In the comments, Emily has questioned whether the power calculations we have done for the Oregon Health Insurance Experiment (OHIE) add anything useful to the discussion. (For those calculations look here, here, and here.) She suggests that we might consult with experts to address this question, pointing us to the work of Hoenig and Heisey (ungated PDF here). Related work has been published by Goodman and Berlin. Both papers describe limitations of post-experiment power calculations.

    I have emailed these authors to solicit opinions of our work. I have also emailed several other experienced biostatisticians recommended to me by colleagues. Though not all of these authors and experts responded to my inquiry, those that did could not point to any problems with the type of power calculation we have offered as most relevant and legitimate. (Ironic full disclosure: I am not fully disclosing who replied and said what because I did not obtain explicit authorization to do so. See below for some direct quotes attributed to other experts.)

    One problematic use of post-experiment power calculations is a particular method of attempting to pin the blame for statistically insignificant findings on sample size. One does this by computing power for a study’s statistically insignificant point estimate, a calculation that is guaranteed to show underpowering. This is not, by itself, useful. It’s not, by itself, a test of the power of the study. It’s merely a re-expression of the statistical insignificance of the finding.

    What is more useful is to compute power for the effect size one expected before the study was done. Doing so exploits the “by itself” caveats in the prior paragraph, as it brings in new information not depending on the study in question but on prior work. In fact, one can do this before the study, and one should. This is a test of the study’s detection ability. Of course, one can do the same calculation after the study is done, which doesn’t make it any less legitimate, though a post-experiment version has less to offer since we are also informed by the study’s confidence intervals.

    So, what does a post-experiment power calculation based on expected effect sizes from pre-study literature offer beyond what can be inferred from confidence intervals? First, there is a scientific contribution: it conveys the sample size needed for future studies, given a specified false negative rate (or power level). It cannot be denied that this is of some value.

    Second, it can be more accurate in estimating that sample size than a pre-study power calculation could be. That’s because after the study one can incorporate study features that could not have been known in advance. I’m specifically thinking of the degree of treatment-control (in this case, randomized to Medicaid and not) cross-over. This is precisely what leads to reduction in power due to the instrumental variables design. One could guess in advance what that power reduction would be. But after the study, one knows precisely what it is, as I’ve shown. If one wanted to design the next, similar study, one would absolutely want to incorporate this effect. And we have.

    Third, there is a rhetorical contribution: it is, in part, a re-expression of something that can be inferred from the confidence interval, whether the study was powered to detect the expected effect size. We appreciate — and I have just stated — that this is not a direct, novel scientific contribution. But conveying scientific results in ways that may be better understood or appreciated by a wider audience is part of the dissemination mission. Sometimes saying the same thing in a different way is helpful. So long as such a transformation does not misrepresent the work, it is not harmful. It is not a worthless exercise. As I’ve said, our work does not misrepresent the study’s findings, and the experts I corresponded with did not contradict that. Nor, by the way, did the study authors, with whom we shared our work at multiple stages of progress.

    In an exchange about power calculations in general, Alan Zaslavsky, Professor of Health Care Policy, an Affiliate of the Statistics Department at Harvard, and a leading expert in his field, wrote me:

    [A power calculation] might be relevant as part of a review of what the investigators might reasonably have anticipated [in advance]. For example after such an analysis one might say that a study should have been larger, although in this case that option was unavailable, or that it was so unlikely to detect effects of plausible sizes for some outcomes that they should not have been included in the analytic plan.

    In a post on The New York Times website that cites our work on this (though not the latest and greatest version of it), Casey Mulligan agreed the study was underpowered and explains the implications for interpretation.

    The only way the study could have found a statistically significant result here would be for Medicaid to essentially eliminate an important symptom of diabetes in a two-year time frame. Medicaid coverage could be quite valuable without passing that standard (even the Supreme Court has looked at this issue and concluded that statistical significance is not the only reliable standard of causation).

    The authors of the study appear to be aware of these issues, because they note toward its end, “Our power to detect changes in health was limited by the relatively small numbers of patients with these conditions.” […]

    If the Oregon study prevents even one state from expanding its Medicaid program, Affordable Care Act proponents could assert that [] emphasis on statistical significance has proven to be deadly. Even if you think, as I do, that the law has fatal flaws, the Oregon study of Medicaid is not the place to find them. [Bold added]

    As is clear from Mulligan’s words, the study authors themselves recognized the power limitations our calculations illuminate.

    Given all this, I do think it is safe to say that experts agree that the OHIE study was objectively underpowered for expected effect sizes on physical health measures (perhaps excluding the Framingham risk score, for which there is insufficient prior information to draw a conclusion). I do think it is deceptive to comment on or draw inferences from those results of the study without explicitly acknowledging this fact. Being upfront about this limitation of this study is all we are trying to persuade the world to do. At this point, not doing so gives the impression of deliberately trying to mislead, which is certainly not something we can tolerate here.

    @afrakt

     
    item.php
  • My reply to Jim Manzi

    In a follow-up to my EconTalk discussion with Russ Roberts about the Oregon Health Insurance Experiment (OHIE), he interviewed Jim Manzi about it this week. Russ invited me to submit a written response to the interview, which you’ll find below. It is linked to from the EconTalk page for the Manzi interview. I have no substantial disagreement with most of the content of the interview. The purpose of this reply is to add more information, not to debate any points.

    Like Jim, I did not have any numerically specific prior views about how much expansion of Medicaid would affect the physical health of non-elderly adults over two years. However, as Jim pointed out, the OHIE investigators suggested that we compare their diastolic blood pressure change results to findings from specific, prior work. Since my conversation with Russ, my colleague and physician Aaron Carroll examined that prior work and shared his thoughts in two posts. He concluded that, for a variety of reasons, we should not have expected the OHIE to reveal the size of change observed in those prior studies, the approximately 5 mm Hg in diastolic blood pressure change that Jim suggested as a rough average. You can read the details at the links for yourself.*

    A key point is that blood pressure reduction should only be expected in a population with initially elevated blood pressure, which was the focus of the prior literature referenced above. In contrast, the headline OHIE result is for all study subjects, only a small percentage of whom had elevated blood pressure at baseline. Unfortunately, there is no reported OHIE subanalysis focused exclusively on subjects with hypertension at time of randomization. Depending on which metrics from the published results you examine, between 3% and 16% of the sample had elevated blood pressure at baseline. Taking the high end, 16% x 5 mm Hg = 0.8 mm Hg is in the ballpark of a reasonable expectation of the reduction in diastolic blood pressure the OHIE could have found (it was also the study’s point estimate) were it adequately powered to do so. Was it?

    I worked with Aaron and fellow health economist Sam Richardson on this question. We found that the study had 80% power (the standard minimum for clinical studies) to detect a change in diastolic blood pressure of 2.82 mm Hg. Put another way, this means that the probability of failing to detect a true change of this size, the false negative rate, is 20%. For the more reasonable, expected 0.8 mm Hg change calculated above, the probability of a false negative is about 86%, or 14% power. This is underpowered by any reasonable criterion.

    But that’s just diastolic blood pressure. What about another measure? In the discussion of their paper, the investigators calculated the reduction in glycated hemoglobin level one might have expected from the clinical literature, 0.05 percentage points. That’s well within the 95% confidence interval of their estimate and corresponds to a false negative rate of 75%, or 25% power. So, the study was underpowered for this measure too.

    Aaron, Sam, and I have calculated power for other physical health measures reported from the OHIE and will share results soon. If you can’t wait, I have posted methods so you can do power calculations yourself. This is possible because power analysis methods are well-known and all necessary parameters are readily available in the published paper. The only difference between the methods I’ve posted and what Aaron, Sam, and I are doing is that we are incorporating a few higher-order nuances, like adjusting for the effect of the study’s survey weights.

    Now for the lightning round. Here are a few quick responses to other aspects of the interview:

    • Jim noted that 40% of lottery winners didn’t apply for Medicaid coverage. As discussed, this might be due to an expectation of low value from Medicaid or lack of follow-through skills (jointly, low prudence). However, half of the 60% who did apply were deemed ineligible. The investigators report that this was largely due to income above the 100% FPL threshold, but other potential reasons include moving out of state, securing other coverage within a six month look-back period, or aging out of eligibility. One might reasonably presume a similar proportion (half) of those who did not apply would have also been ineligible. Perhaps some knew that to be the case and spared themselves the fruitless exercise of completing the forms. It seems reasonable to me that those capable of weighing the value of Medicaid would also know whether their incomes were too high, they moved out of state, they secured other health insurance coverage, or were too old. Therefore, it is likely that substantially fewer than 40% of the non-applicants suffered a lack of prudence. Judging from the proportion of applicants deemed ineligible, perhaps the number of imprudent non-applicants is closer to 20%. This is speculation, but no less plausible than that Jim or Russ offered.
    • The RAND Health Insurance Experiment was not a study of health insurance coverage since it did not include any uninsured subjects. It was a study of cost sharing, capped at $1,000 (circa mid-1970s dollars) for all participants.
    • The OHIE depression reduction result was not observed largely or entirely in the first month after enrollment. The investigators didn’t conduct a depression screen in a one-month survey, but did in later surveys, as Adrianna McIntyre explains. However, self-reported health did improve substantially in the first month.
    • To what extent the findings are informative about Obamacare’s Medicaid expansion would be an excellent topic of discussion. Neither Jim, Russ, nor I got into this question very deeply. It’s properly a question of external validity, not bias, which is something else.

    In conclusion, I applaud Russ for devoting two episodes to the OHIE. It is an important study, both for its subject and methods, and it deserves at least that much attention. I also praise Jim for his addition of substantial value to the conversation. I hope I have helped clarify a few points.

    * Aaron’s posts largely focus on systolic blood pressure, though diastolic is mentioned and is also included in the cited studies. Suffice it to say, the same issues of expected effect size and insufficient power arise for systolic blood pressure as I discuss for diastolic. I focused on diastolic because it is what Jim mentioned and the lead investigator emailed me about.

    @afrakt

     
    item.php
  • Bias, validity, and terminology

    After posting and editing the following, I realized that I should promote (again) Mostly Harmless Econometrics by Angrist and Pischke. It covers in great detail issues raised below, and so much more. I’ve clearly forgotten some of its contents. I could not easily find answers to my questions below in the time available this morning. So, let’s crowd source them. 

    This is why I blog and in the style I do so. After some back-and-forth in the comments to my post on bias yesterday (go read those comments), Matt offers more precise terminology — distinguishing bias from internal and external validity — within a differently organized discussion. I like what he’s done. My comments and questions are interleaved with his. Let’s keep discussing this!

    (1) What internally valid estimates can we obtain from the Oregon Study?

    We can obtain the effect of winning the lottery (ITT) and the effect on the population that gained insurance due to winning the lottery (LATE).

    We cannot obtain the ATE [average treatment effect] or the TOT [effect of treatment on the treated]; the seemingly natural estimators of these quantities are biased since the populations we are comparing differ due to self-selection. The ITT and LATE avoid this problem because they scrupulously _solely_ compare the full group of lottery winners to the full group of lottery losers.

    I agree that LATE exploits the lottery, but does it really compare the full groups of winners to losers? My understanding is it compares the two groups of compliers, as I wrote. That’s the difference between ITT and LATE.

    (2) What internally valid estimates can we obtain from alternative study designs? How do they differ?

    From a perfect compliance RCT, we can estimate the ATE for the study population. Relative to the group covered by the LATE, the group covered by the ATE also includes: (A) the types of people who still enroll if they lose; and (B) the people who will not enroll even if they win.

    Point of clarification: With perfect compliance, there are no people who still enroll if they lose. There are no people who do not enroll even if they win. However, those groups exist in an RCT without full compliance and, as I wrote above, LATE filters our their effect. Under full compliance LATE is the same as ATE is the same as ITT. The way I’d put this is not that ATE includes these noncompliant groups. I’d say that the ITT and LATE estimates are the ATE in a fully-compliant RCT. They are not in an RCT without full compliance. Continuing with Matt’s section (2):

    From a Oregon-like study in which we forbid enrollment by lottery losers, we can obtain the TOT. Relative to the group covered by the LATE, the group covered by the TOT adds (A) from above but not (B).

    TOT compares treated with untreated. I can think of three ways to do this, and I’m not certain which one we call TOT. Way 1 compares all treated to untreated, regardless of random assignment. Way 2 does so only for those assigned to treatment. Way 3 does so only for those assigned to control. Which one is TOT? Why does it incorporate a group like (A) but not (B)? The only version of the three versions of TOT I suggested that is consistent with that is way 3. But way 3 is the one that sounds least likely to be what one means by TOT. (I’d order it as way 1 > way 2 > way 3.) Still continuing under his section (2):

    The LATE/ATE/TOT difference is not about bias. Each average treatment effect is perfectly valid for the population it pertains to; those populations are just different.

    I agree that the differences among these types of estimates have nothing to do with bias. However, Matt wrote early in his comment (way above), “the seemingly natural estimators of these quantities are biased since the populations we are comparing differ due to self-selection.” I think we need to explore this more. What does “seemingly natural” mean? It appears to be doing a lot of work here. When do we say we have obtained a biased estimate? Can we give a precise example? Does it merely mean we haven’t really computed one of them properly?

    (3) Which estimates have the greatest external validity for the policy questions of current interest?

    This is a hard question. The answer depends on whether the group affected by our proposed policy looks more like the group included in the LATE or the full population. How do we think about that?

    Insert your existing discussion of this point.

    Matt is either referring to my comments to my post or the post itself. Either way, I presume you’ve read them.

    @afrakt

     
    item.php
  • For economist/biostats geeks – ctd x2 (with intro for non-geeks)

    Sorry for the false start last night. More about that here. I still welcome comments if you find any errors.

    This is a follow up to my several prior posts on how to adjust a power calculation to account for an instrumental variable (IV) design. The details are in a new PDF. (If you downloaded the one I posted last night, replace it with this one.) First, for the non-geeks:

    • Skip the proof and jump to the example that begins on page two of the PDF. It runs through the numbers for the Medicaid study result for glycated hemoglobin (GH), which I had used to illustrate the power issues in my first post on this topic. (It’s a commentary on this blog’s readership that I can even consider this example suitable for non-geeks. I guess I mean geeks of a different order.)
    • One thing you may notice is that the Medicaid and non-Medicaid groups are different sizes than you might have expected if you only read the paper and not the appendix. I refer you to appendix table S9 for the details. Suffice it to say, it is not true that 24.1% of the lottery winners took up Medicaid. There were a lot more Medicaid enrollees than that. (What is true is that 24.1% more lottery winners took up Medicaid than non-winners.)
    • For that reason, and because I was targeting 95% power, my estimate in my first post was quite a bit off. I thought the study was underpowered by a factor of 5 for the GH measure. Actually, according to the methods in that post, and using the new numbers and targeting 80% power (which, I am told, is more standard), the study is only underpowered by a factor of 1.5.
    • But, as I wrote in that post, I had not accounted for the IV design. The new calculation does so. And that, my friends, really wallops power and precision. The bottom line is, accounting for the design, the GH analysis was underpowered by about  a factor of 23 (yes, twenty-three!) meaning it’d have needed that multiple of sample to be able to detect a true Medicaid effect with 80% probability.
    • You can run the numbers for other measures using this online tool. The underpowering will vary. Below is a screenshot for the inputs for the GH analysis. Follow the steps in the PDF for the rest. (Hint: multiply the sample sizes from the online tool by 14.8.)

    statpage

    Now, for the uber-geeks, the content of the PDF differs from my prior version of a few days ago in three ways:

    1. It properly accounts for the fact that we were assuming all vectors were zero mean. That didn’t affect the result, but it does affect how you should simulate the first stage (which we’ve done for you for the Medicaid study in the document).
    2. It references Wooldridge, who obtained the same result. (So, we’re right!)
    3. It includes a complete example from the Medicaid study. However, don’t overlook the fact that this generalizes. Truth be told, I didn’t do all this to comment on the Medicaid study. I need this for my own work.

    I should point out that the finding that the variance of the effect size in an RCT scales with the inverse of Np(1-p) is beautiful. It doesn’t just scale with 1/N because it is the mean of a difference. When p goes to zero, there are no treatments. When p goes to 1, there are no controls. Either way, the variance of the difference in effect size has to go to infinity. And, indeed it does. This is comforting intuition.

    Finally, I’m grateful for the awesome feedback I’ve received from readers. Once again, the TIE community has hit this one out of the park. Thank you.

    @afrakt

     
    item.php
  • For economist/biostats geeks – ctd.

    I thank the readers who suggested publications that might include the concept I blogged about a few days ago, the extent to which IV estimation diminishes precision (power) relative to a straight-up RCT estimate.

    Though it is possible some of that literature implied the result Steve had obtained, most of it seemed either not quite on target or more complex than necessary. So, I took the trouble of writing up the proof more carefully (PDF) and confirming it for myself writing it up and subsequently refining it (see update in a later post).

    I still think this must be a known result, and if you are aware of a publication that includes a proof in a simple form, please let me know.

    @afrakt

     
    item.php
  • For economist/biostats geeks only (a bleg)

    If you’re not into instrumental variables (IV) econometrics and/or power calculations don’t bother reading this post. I’m not even going to try to make it widely accessible. But if you are an econ/biostats type, I have a question for you. I want to know if you have seen anything like the following in any paper or book. I am looking for a supporting reference, if it’s out there.

    One thing that came up in the power calculation discussions I’ve been hosting here lately is that instrumenting for the treatment indicator reduces power. It’s convenient to do a power calculation ignoring that fact, pretending as if you are running a randomized controlled trial. But if you’re really running an IV, how much more sample do you need to achieve adequate power? Or, put another way, how much does the IV sap your power?

    Every time I’ve seen this question raised, the next thing I see is that it’s too complicated to figure out. Except that it turns out that, in the linear case at least, it really isn’t. My colleague Steve Pizer did the math and got a nifty little result. To convey it, it’s simplest to consider a two-stage least squares (2SLS) set up with no controls, like this:

    X = Zγ + ν   (first stage)

    Y = Xβ + ε

    where X is the vector of treatment indicators, Z is the vector of instruments, Y is the vector of outcomes, and all the usual assumptions for 2SLS apply. (To consider a case for which there are additional control variables, first regress the treatment, instrument, and outcome variables on them and compute the residuals. Then use the above on these residual versions. The result below still works out, with one change.)

    Assume you have done a power calculation that suggests you need N observations in the treatment group* to obtain a sufficiently powered estimate of the effect of treatment X on outcome Y, pretending it’s a randomized trial (no IV). Steve showed that the IV setup requires N/R² observations, where R² is the “R-squared” of the first-stage shown above. That is, the less predictive power the first stage has (the lower its R²) the more observations you need, which is intuitive. Also, if the instrument is the treatment indicator (a limiting case), R² is obviously 1, and you get back the result that you need N observations for sufficient power. Finally, if your instrument has no predictive power, R² is 0, and you need infinite observations, which is sensible. (In the case for which you did this in the residual space to handle additional controls, the number of observations required is X’X/R². It just turns out that X’X = N when X is a vector of treatment indicators.)

    This is such a simple, appealing result that someone else must have written it down in some book or paper. My question for you is, who and where?

    Steve’s derivation is below. I can’t be bothered to type up all the equations because it’s a pain. I apologize for his handwriting, though he may not.

    * Go ahead and assume N observations in the control group too, though I think this all works just fine if the power calculation is done such that the control group size is some specified proportion of the treatment group size, like rN for some scalar r > 0.

    IV power

    @afrakt

     
    item.php
  • What about power for the blood pressure result? (And so much more)

    A few commenters have questioned my power calculation on the Oregon Medicaid study, claiming different results. Though I can’t be sure what they are doing wrong (if anything), I did take the time to do several more checks of my calculation. These are in the technical footnote to this post.* Even though it’s weedy, if you’ve followed this story this far, you might want to look. It shows how you can do power calculations at home, with no money down! Meanwhile, the offer stands: if you find an error in my work, please let me know, but read the footnote first.

    The question has been raised about how the study’s blood pressure findings compare to that of the RAND Health Insurance Experiment. (Harold also discussed this.) First, let’s deal with power. The baseline rate of elevated blood pressure in the Oregon study was 16.3% and the point estimate of the effect of Medicaid was a reduction of 1.33 percentage points. These are both bigger than the blood sugar (glycated hemoglobin, GH, A1C) results, which was the focus of my power calculation. So, maybe the blood pressure analysis was sufficiently powered. We have a calculator. Let’s find out!* (Of course, the 95% confidence intervals give us an answer, but how underpowered is it?)

    No, the blood pressure analysis was no more adequately powered than the blood sugar one. Even though the baseline rate is a lot higher, the hypothesized effect size isn’t. However, the study was powered at the 0.85% level to find a reduction in proportion of the population with high blood pressure of 3 percentage points (more than twice the point estimate effect size). See, power depends on what question you’re asking.

    I’m told, but have not independently verified, that the RAND HIE did find statistically significant results on blood pressure. That study had a sample size of 7,700 across four levels of cost sharing and followed participants for 3-5 years. The design and analytic approach were different than the Oregon Medicaid study, which could explain a difference in statistical significance. Also, RAND’s effect size was larger.

    About this, Kate Baicker, the lead author of the Oregon Medicaid paper, wrote me,

    The confidence intervals of our estimates of the impact of Medicaid tell us what effect sizes we have the power to reject. This can be read off of our reported confidence intervals. Consider, for example, the case of blood pressure. Table 2 indicates that over 16 percent of our control group has elevated blood pressure. For diastolic blood pressure, we see in Table 2 that the lower end of our 95 percent confidence interval is -2.65 mm Hg. This means that we can reject a decline in diastolic blood pressure of more than 2.65 with 95 percent confidence.

    For context, it is instructive to compare what we can reject to prior estimates of the impact of health insurance on blood pressure. In particular, the RAND Health Insurance Experiment – which varied only the generosity of insurance coverage among the insured and not whether enrollees had insurance at all, as in the Oregon Health Insurance Experiment – found a reduction of 3 mm Hg in diastolic blood pressure among low-income enrollees. Quasi-experimental studies (previously published in NEJM) of the one-year impact of the loss of Medicaid (Medi-Cal) coverage among low-income adults found changes in diastolic blood pressure of 6 – 9 mm Hg (Lurie et al. 1984, 1986). The estimates in Table 2 allow us to reject that Medicaid causes a decline in diastolic blood pressure of the magnitude of the effects found in these prior studies. (These RAND and Medi-Cal estimates are based on a sub-population in disproportionately poor health, so one might instead compare their estimates to our estimates in our Appendix Table S14c showing the impact of Medicaid on diastolic blood pressure among those diagnosed with hypertension prior to the lottery. For this group we can reject a decline in diastolic blood pressure of more than 3.2 mm Hg with 95% confidence).

    I don’t know what else I can say about all this. If you want to know if the study could reject the possibility that Medicaid had no effect on the physical health measures examined at 2-years of follow-up with 95% confidence, the answer is “no.” At the same time, the sample size was too low to be able to do that for all but very large effects. That’s just a mathematical fact. For effect sizes one might reasonably consider appropriate (and that are certainly clinically meaningful), the study would have had to have been several multiples larger (a factor of five is what I get). Again, that’s just math.

    Please stay for the technical footnote:

    * TECHNICAL FOOTNOTE: In contrast to what most people may think, I largely post on TIE to further my own knowledge and understanding, not to convince anyone of anything. So, if anyone finds errors in what I’ve written, I’m happy for the correction. But, I also recognize that I’m posting for a wide audience, and so I worry about the validity of the content of my posts long after they’re public. I continued to worry about my sample size calculation yesterday and this morning.

    To increase confidence I had not made a grave error, I did my sample size calculation two additional and independent ways. First, it turns out Stata’s sampsi can be used many ways to do the same thing. Some ways require less input than others, which is safer since it is always possible to misunderstand what the proper form of the input is. Nevertheless, no matter how I used sampsi, I got the same answer, which is comforting.

    Second, I used an online sample size calculator for the difference in proportions. I used the one here, but if you Google around, you’ll find others. Again, I got the same result as with sampsi. I encourage you to try it yourself. Below is a screenshot of the inputs and outputs for the calculation in my post. The only thing I didn’t mention in my post is what alpha is. It’s the probability of rejecting the null hypothesis (that Medicaid had no effect) under the assumption that it is true, the “p-value” of an estimate. Typically one seeks a value of 0.05 or lower. (Super geeky aside, “power” is not the same thing as “p-value.” The former is the probability of rejecting the null when it is false, the latter of rejecting it when it is true.)

    power calc

    @afrakt

     
    item.php
  • Power calculation FAQ

    Answering some questions about power calculations:

    1) What is a power calculation?

    It’s a calculation that tells you, given sample sizes, assumed baseline risk, and treatment effect size, with what probability you can reject the hypothesis that the treatment had no effect (the “null hypothesis”). In the context of my post, the sample sizes are the numbers of people in the control group (~6000) and the number who received Medicaid (~1500). The baseline risk of elevated GH was 5.1% and the treatment effect size was a reduction in that risk of 0.93 of a percentage point.

    2) What do you use a power calculation for?

    One thing I use it for, as do others, is to estimate how big a sample one would need to achieve 95% probability of rejecting the null hypothesis. That’s how I used it in my post.

    3) Yeah, but what does that mean?

    How big the study has to be so the 95% confidence interval on the estimate doesn’t overlap zero. If it overlaps zero then one cannot, conventionally, say that the result is statistically significantly different from no effect.

    4) But the results in the Medicaid study on physical health did overlap zero. So, huh?

    Right. The point of my post was, how big would the study have had to have been so that didn’t happen, assuming the baseline risk and average effect size the paper reported? In a variation, I considered how big the baseline risk would have had to have been holding the sample sizes constant.

    5) Oh, so you’re saying that if they had found physical health effects to be statistically significant you’d be OK with that, but since they didn’t you’re saying that the study is flawed. How is that fair?

    No. The study is not flawed. It just has limitations, as do all studies. I am saying that when a result is statistically significant, it means that the sample size was adequate. There’s no need to do a power calculation in that case. When it isn’t, it’s possible the sample size was not adequate. One has to do a power calculation to check. One has to answer the question, could this study have ever found a statistically significant result for a reasonably sized effect? If the answer is “no” then using the study to show the intervention has no effect is not very persuasive.

    6) Why don’t you do power calculations for other studies?

    You can search the blog, but I make habit of focusing only on statistically significant findings, which means power is sufficient. I don’t, generally take the time to examine statistically insignificant ones. The issue of power does cross my mind from time to time. When I think a study is underpowered (too small a sample size) I generally don’t discuss it at all.

    7) Why are you talking about it now, then?

    Because this is a hugely important study and tons of other people are talking about it. Moreover, they’re talking about the physical health findings as if the study had enough sample to detect reasonably sized, clinically significant differences. Unless I made an error, it doesn’t seem to me that it did. My interest in properly interpreting the study is why I did the power calculation publicly. I shared it with the authors too.

    8) So the whole study is invalid! Why did you hype it a year ago?

    The study is not invalid. It doesn’t have enough sample to answer some questions, and it does have enough sample to answer others. Remember, a power calculation pertains to a specific analysis, to one measure. It depends on the baseline rate and treatment effect size. Those are different for each measure examined.

    9) But you only looked at elevated GH. What about the other measures?

    See my next post. Spoiler alert: The conclusion is the same. The sample was too small.

    10) This is all very nice, but I’m not convinced you aren’t just trying to explain away a result you don’t like.

    If I can’t convince you with math, I doubt I can with words. Do I dislike the results? I confess, I don’t feel very emotional about them. Do you?

    @afrakt

     
    item.php
  • Power calculations for the Oregon Medicaid study

    A follow-up to this post is here. It includes instructions on how to run your own power calculations. 

    Kevin Drum:

    Let’s do the math. In the Oregon study, 5.1 percent of the people in the control group had elevated GH [glycated hemoglobin, aka A1C, or colloquially, blood sugar] levels. Now let’s take a look at the treatment group. It started out with about 6,000 people who were offered Medicaid. Of that, 1,500 actually signed up. If you figure that 5.1 percent of them started out with elevated GH levels, that’s about 80 people. A 20 percent reduction would be 16 people.

    So here’s the question: if the researchers ended up finding the result they hoped for (i.e., a reduction of 16 people with elevated GH levels), is there any chance that this result would be statistically significant? […] The answer is almost certainly no. It’s just too small a number.

    I plugged these numbers into Stata’s sample size calculation program (sampsi) to do a power calculation for the difference between two proportions. I found that the probability that we can reject the null hypothesis that Medicaid would have no effect on GH levels to be 0.35. Under ordinary interpretations of statistical significance, the null cannot be rejected. We knew this from the paper, and, hence, all the hubbub. (Never mind that we also cannot reject a much larger effect. The authors cover this in their discussion.)

    The standard level of statistical significance is rejecting the null with 0.95 probability. Assuming the same baseline 5.1% elevated GH rate and a 20% reduction under Medicaid, what sample size would we need to achieve a 0.95 level of significance? Plugging and chugging, I get about 30,000 for the control group and a 7,500 treatment (Medicaid) group. (I’ve fixed the Medicaid take-up rate at 25%, as found in the study.) This is a factor of five bigger than the researchers had.

    Now, caveats:

    • I’m taking the baseline rate, 5.1% from the study itself. But we know it is estimated with some imprecision. Maybe a different, more reliable rate is available elsewhere, but I don’t know what it is. Suffice it to say, it would take a lot of error on this number to overcome a factor of five in sample size. Assuming, again, a 20% reduction due to the intervention, I calculated that if the baseline rate were about four times bigger (e.g., about 20% instead of 5.1%), then the sample in the paper would have been sufficient to reject the null at the 95% level.
    • The analysis in the study is not as simple as a straight comparison of two proportions. There is some multivariate adjustment for observable factors. There are some tweaks to due to measurement of multiple outcomes and weighting for survey design. It’s also an IV analysis, which retains in the sample patients who were randomized to treatment (won the lottery) but didn’t enroll in Medicaid. (This actually decreases power.) For these reasons, my power calculation is not fully correct. But, still, it would take a lot to overcome a factor of five in sample size.
    • It is always possible I’ve made an error. If so, I’m happy for someone to correct it. Below is my code and output, using sampsi. Anyone think I did something wrong?

    Code

    local r = 0.0093 /* absolute change in rate due to intervention, from the paper */

    local p1 = 0.051 /* baseline rate, from the paper */

    local p2 = `p1′ – `r’ /* treatment group rate */

    local sd1 = sqrt(`p1’*(1-`p1′)) /* standard deviation of baseline rate */

    local sd2 = sqrt(`p2’*(1-`p2′)) /* standard deviation of treatment group rate */

    local n1 = 6000 /* control group size, approx from the paper */

    local n2 = int(0.25*`n1′) /* treatment group size, approx from the paper */

    sampsi `p1′ `p2′, sd1(`sd1′) sd2(`sd2′) n1(`n1′) n2(`n2′)

    Output

    Estimated power for two-sample comparison of means

    Test Ho: m1 = m2, where m1 is the mean in population 1

    and m2 is the mean in population 2

    Assumptions:

    alpha = 0.0500 (two-sided)

    m1 = .051

    m2 = .0417

    sd1 = .219998

    sd2 = .199903

    sample size n1 = 6000

    n2 = 1500

    n2/n1 = 0.25

    Estimated power:

    power = 0.3517

    UPDATE: Mentioned the power reducing effect of IV.

    @afrakt

     
    item.php