• Power calculations for the Oregon Medicaid study

    A follow-up to this post is here. It includes instructions on how to run your own power calculations. 

    Kevin Drum:

    Let’s do the math. In the Oregon study, 5.1 percent of the people in the control group had elevated GH [glycated hemoglobin, aka A1C, or colloquially, blood sugar] levels. Now let’s take a look at the treatment group. It started out with about 6,000 people who were offered Medicaid. Of that, 1,500 actually signed up. If you figure that 5.1 percent of them started out with elevated GH levels, that’s about 80 people. A 20 percent reduction would be 16 people.

    So here’s the question: if the researchers ended up finding the result they hoped for (i.e., a reduction of 16 people with elevated GH levels), is there any chance that this result would be statistically significant? […] The answer is almost certainly no. It’s just too small a number.

    I plugged these numbers into Stata’s sample size calculation program (sampsi) to do a power calculation for the difference between two proportions. I found that the probability that we can reject the null hypothesis that Medicaid would have no effect on GH levels to be 0.35. Under ordinary interpretations of statistical significance, the null cannot be rejected. We knew this from the paper, and, hence, all the hubbub. (Never mind that we also cannot reject a much larger effect. The authors cover this in their discussion.)

    The standard level of statistical significance is rejecting the null with 0.95 probability. Assuming the same baseline 5.1% elevated GH rate and a 20% reduction under Medicaid, what sample size would we need to achieve a 0.95 level of significance? Plugging and chugging, I get about 30,000 for the control group and a 7,500 treatment (Medicaid) group. (I’ve fixed the Medicaid take-up rate at 25%, as found in the study.) This is a factor of five bigger than the researchers had.

    Now, caveats:

    • I’m taking the baseline rate, 5.1% from the study itself. But we know it is estimated with some imprecision. Maybe a different, more reliable rate is available elsewhere, but I don’t know what it is. Suffice it to say, it would take a lot of error on this number to overcome a factor of five in sample size. Assuming, again, a 20% reduction due to the intervention, I calculated that if the baseline rate were about four times bigger (e.g., about 20% instead of 5.1%), then the sample in the paper would have been sufficient to reject the null at the 95% level.
    • The analysis in the study is not as simple as a straight comparison of two proportions. There is some multivariate adjustment for observable factors. There are some tweaks to due to measurement of multiple outcomes and weighting for survey design. It’s also an IV analysis, which retains in the sample patients who were randomized to treatment (won the lottery) but didn’t enroll in Medicaid. (This actually decreases power.) For these reasons, my power calculation is not fully correct. But, still, it would take a lot to overcome a factor of five in sample size.
    • It is always possible I’ve made an error. If so, I’m happy for someone to correct it. Below is my code and output, using sampsi. Anyone think I did something wrong?


    local r = 0.0093 /* absolute change in rate due to intervention, from the paper */

    local p1 = 0.051 /* baseline rate, from the paper */

    local p2 = `p1′ – `r’ /* treatment group rate */

    local sd1 = sqrt(`p1’*(1-`p1′)) /* standard deviation of baseline rate */

    local sd2 = sqrt(`p2’*(1-`p2′)) /* standard deviation of treatment group rate */

    local n1 = 6000 /* control group size, approx from the paper */

    local n2 = int(0.25*`n1′) /* treatment group size, approx from the paper */

    sampsi `p1′ `p2′, sd1(`sd1′) sd2(`sd2′) n1(`n1′) n2(`n2′)


    Estimated power for two-sample comparison of means

    Test Ho: m1 = m2, where m1 is the mean in population 1

    and m2 is the mean in population 2


    alpha = 0.0500 (two-sided)

    m1 = .051

    m2 = .0417

    sd1 = .219998

    sd2 = .199903

    sample size n1 = 6000

    n2 = 1500

    n2/n1 = 0.25

    Estimated power:

    power = 0.3517

    UPDATE: Mentioned the power reducing effect of IV.


    • Thanks for the calculation. Hopefully, the authors in follow up comments in the NEJM will address the issue of lack of statistical power.

    • The question I haven’t seen answered is what effect size would we need to call the result medically significant. Is there a threshold at which we can say that the health benefits of Medicaid outweigh the costs? It would be much more interesting to do the power calculation around that threshold, because ultimately what policy makers need is enough observations such that a statistically insignificant result would rule out the possibility that the effect size is medically significant.

      • Reducing elevated GH levels is clinically significant for any individual. Therefore, if Medicaid had any effect in this regard on some proportion of individuals, that is significant. The question is, then, what proportion is expected ex ante? This isn’t a question of clinical significance but of what to expect from an intervention of providing access to insurance.

        What is clear is that this study wasn’t even designed to detect a 20% decrease in elevated GH levels, let alone 10%, 5%, or 1%.

    • Another way of writing what you did would be that they had 83% power to find a decline from 5.1% to 3.5% of patients with A1C > 6.5%. That would be an impressively huge decline, but even so I see two major problems.

      First, I doubt it works this way for instrumental variables analyses. What you describe here looks like a per-protocol analysis, where the people who did not accept Medicaid are omitted. In IV they’re kept in and function as noise– making the power worse.

      Also, this ignores that their analysis doesn’t even fit basic current guidelines. A1c > 6.5 is permitted (though not universally used) to diagnosis diabetes. It is not standard for treatment, though. Most guidelines recommend < 7, VA quality measures say 3/4. These would be amazing effect sizes for this type of intervention.

      • Good point on IV. I should add that. But since it only makes matters worse, the conclusion remains the same.

        I appreciate that 6.5 may not be the right threshold. Moving to 7 would decrease sample, again, making matters worse.

        I don’t understand what “VA quality measures say 3/4” means. 3/4 of what?

    • I wish the authors had put more thought into what data to present for their first publication of this study. I doubt they intended this to be the definitive analysis of the value of having Medicaid and almost certainly didn’t know that it would make such a splash in the mainstream. This is a very data rich database that will serve for years of study and analysis. Lab results comparisons are fairly easy and straightforward, and that’s probably why they chose these two elements of health indicator. I think it was a mistake. Given the wide interest in this data, I wish they had kept to the theme of general health and left out the specialized indicators for later.

      When patients have an initial high glucose level, it often takes several years to get it under control. There are more specialized tests before a dx of diabetes is made. Then the first line is diet and exercise and self-monitoring, see how they do there. Then there’s the oral meds, more self-monitoring. It can often be several years before they determine that there should be any more serious treatment to manage the condition. So yeah, in two years you want to see a reduction in glucose levels. But that effect would be highly diluted in a dataset with lab results of mostly normal people. To show any “value” of having health insurance through reduction in A1C levels across the entire sample? It’s not the first analysis I would present.

      And now we have non-scientists with political agendas, all of a sudden armchair epidemiologists, waving this thing around with their seat-of-the-pants determinations of the value of Medicaid based on — kind of — unwise initial presentation of a wonderful dataset. Sigh.

        • Ah! My bad. I was unaware of prior pubs; thanks for pointing them out.

          They could have done a matched case-control study, for example, or looked at glucose levels over time in the cohort, which probably would have solved their power problem and shown some results that were much more informative and interesting. They could still do that, and probably should, to rectify all the misinformation floating around the political/policy circles. (That is, if one *really* wanted to demonstrate effect of health insurance on diabetes management. It’s kind of an obscure point. The positive effects of *any* kind of competent diabetes management on the patient’s health is well known well established and well quantified. The problem is when you can’t get it, or won’t follow it.)

          I’m sure they didn’t anticipate the timing and political reax, and rushed out some mundane analysis under pressure of publishing and/or presentation. We’ve all done it but it’s a crying shame that it was this, this one, that carries such sweeping import and political risk.

    • Also bear in mind that the authors simultaneously examined about 20 factors, including mostly irrelevant things like mean BP in a mostly healthy population. Adjusting for repeated testing only makes the power issues worse.

      From a study design perspective, the authors might be better served with a more focused analysis on the three process measures, particularly on the treatment of people with high BP, cholesterol, and poorly controlled glucose. A previous commentator suggested stratifying the levels and doing the analysis that way.

    • Any way you slice the #’s aren’t good. Best case scenario – Medicaid had *some* small, incremental improvement. Worse case – none.

      The larger point here isn’t the effect of Medicaid on health – it’s the effect of our healthcare *system* on health. Medicaid simply neutralized cost – and highlights the limited effect our healthcare system has on chronic conditions. We know that – not sure we need a scientific study to prove it – but at least now we have one.

      Zooming out from there – it’s a system that’s been built on/around infectious diseases – not chronic ones. The health conditions being tested in the Oregon study were ALL chronic conditions = total mismatch.

      The more interesting study is the one for Medicare by Health Care Partners (HCP) in Doyletown, PA. The results are *breathtaking!”

      The test group had to have at least one chronic condition – and they had to have had one hospitalization in the last year. That’s the cohort you want – because they are chronic conditions.

      The results? HCP was able to lower hospitalizations by 33% – and cost to Medicare by 22%! Read more about it here:


    • The power (sensitivity) analysis is somewhat misguided IMO for the following reason. It presumes that treatment has some positive effect. Moreover, that treatment effectiveness in linear: the more treatment you receive, the better your condition should become. The data probably don’t exist in unified accessible format, but wouldn’t it be much more enlighten ting to measure clinical effectiveness between the Medicaid Treatment cohort and a demographically matched cohort of “regular” patients. Said another way: the effectiveness of results needs to be normalized against a similar treatment group, not against a “no treatment” group. If in aggregate, we can only expect 50% of a given cohort to respond within the study period, we need to correspondingly reduce the expectations for results in the study cohort. I don’t see this study taking proper account of that factor..

    • Not bad research. Here are some thoughts I found on the null hypothesis.

    • You keep harping on the diabetes result, yet lack of statistical power surely cannot explain the failure to find a significant result for high blood pressure, especially in light of the RAND HIE findings, which I explain here: http://www.forbes.com/sites/chrisconover/2013/05/07/does-the-oregon-health-study-show-that-people-are-better-off-with-only-catastrophic-coverage/

      For low income non-elderly adults, the OHS had a far bigger sample than the HIE, yet the HIE was able to demonstrate a statistically significant improvement in HBP in the free care group compared to those in cost-sharing plans. The OHS, with a much bigger sample, failed to demonstrate Medicaid has an impact on HBP compared to people with no health insurance whatsoever!

      • Perhaps someone can find an error with my power calculation or why the RAND study would have a different one, but the BP result is captured by my example of increasing the baseline rate by a factor of four, to 20%. The elevated BP rate is 16%, which is below 20%, so is not high enough for sufficient power, assuming an intervention effect on 20% of the population.

        How big are the samples in the BP analysis on the RAND study? What is the underlying rate of high blood pressure? It’s possible this paradigm doesn’t apply to RAND since it had >2 arms and pooled analysis across all of them. Not sure though.

      • A quick look back at my RAND posts reminded me that the HIE included 7,700 people over 3-5 years. I’d have to be pointed to the BP analysis specifically to go the next step. Know where it is?

    • At the risk of asking a foolish question/making a foolish comment, when I go to the stattrek.com website you linked to and look at the eqns they use something doesn’t look right.

      To a reasonable approximation, i.e., presuming a binomial distribution, sigma(p) =sqrt(p*(1-p)/N) where N is the number of trials. The mean number of events is n=pN. Unless I’m missing something obvious – a distinct possibility – it appears that stattrek has sigma(p) = sqrt(p*(1-p)/n) = sqrt(1-p) which, aside from being incorrect, is a whole lot larger than sqrt(p*(1-p)/N) for the sample sizes you’re dealing with.

      Going back to the binomial approximation…

      If p1=0.051 and N1=6000 then n1=306, sigma(n1)=17 and sigma(p1)=.0028

      If p2=0.051-.0093=0.0417 and N2=1500 then n2=63, sigma(n2)=8 and sigma(p2)=0.0053

      Figuring those numbers are reasonably accurate, how well separated are p1 and p2?

      delta(p)/sqrt(rss devs) = 0.0093/sqrt(.0028^2 + 0.0053^2) = 1.5

      Without going to a lookup table and finding the exact answer, that the estimated means are separated by 1.5x the rss st.dev. seems like a pretty strong argument for rejecting the null hypothesis.
      What am I missing?

    • So, with the caveat that my propensity for error is probably no lower in the early AM than late PM, I plot the probability distribution functions for p1 and p2. Call the pdf1 and pdf2. They overlap a bit. (I’m back to using N1=6000.) I also look at the cumulative distribution functions, cdf1 and cdf2. I then calculate the probability that p1 is greater than p2:

      p(p1>p2) = integral( pdf2(p’) * cdf1(p’) d(p’) )

      cdf1(p2) gives me the fraction of p1 values than are p2)=0.94. Not quite 0.95 but pretty darn high. Based on my calc above I’d be very confident that the treatment reduced the GH levels.

      • The last paragraph got messed up. It should read:

        “cdf1(p2) gives me the fraction of p1 values than are p2. I get p(p1 greater than p2) = 0.94. Not quite 0.95 but pretty darn high…”