• More Medicaid study power calculations (our rejected NEJM letter)

    Sam Richardson, Aaron, and Austin submitted a more efficiently worded version of the following as a letter to The New England Journal of Medicine (NEJM). They rejected it on the grounds that our point of view would be adequately represented among the letters accepted for publication. Those letters are not yet published.

    The Oregon Health Insurance Experiment (OHIE), a randomized controlled trial (RCT) of Medicaid, failed to show statistically significant improvements in physical health; some have argued that this rules out the possibility of large effects. However, the results are not as precisely estimated as expected from an RCT of its size (12,229 individuals) because of large crossover between treatment and control groups.

    The Experiment’s low precision is apparent in the wide confidence intervals reported.  For example, the 95% confidence interval around the estimated effect of Medicaid on the probability of elevated blood pressure spans a reduction of 44% to an increase of 28%.

    We simulated the Experiment’s power to detect physical health effects of various sizes and the sample size required to detect effects sizes with 80% power. As shown in the table below (click to enlarge), it is very underpowered to detect clinically meaningful effects of Medicaid on the reported physical health outcomes. For example, the study had only 39.3% power to detect a 30% reduction in subjects with elevated blood pressure. It would have required 36,100 participants to detect it at 80% power. Moreover, such a result is substantially more than could be expected from the application of health insurance.

    OHIE power table

    To estimate power levels shown in the table, we ran 10,000 simulations of a dataset with 5406 treatments and 4786 controls (the study’s reported effective sample sizes given survey weighting). We took random draws for Medicaid enrollment based on the probabilities reported in the study. We took random draws for each outcome: probabilities for the non-Medicaid population are given by the control group means from the study, adjusted for the 18.5% crossover of controls into Medicaid; the probability of the outcome for those on Medicaid is X% lower than the probability for those not on Medicaid, where X% is the postulated effect size.

    For each simulated dataset, we regressed the outcome on the indicator for treatment (winning the lottery), and the power is the percentage of the 10,000 iterations for which we rejected at p = 0.05 the hypothesis that winning the lottery had no effect on the outcome. To estimate the total sample size required for 80% power, we conducted a grid search for the lowest sample size that provided 80% probability of rejecting the null hypothesis, running 1000 simulations for each sample size. Our required sample sizes account for sampling weights, and are therefore comparable to the 12,229 total subjects from the study. We do not account for clustering at the household level or controls for household size (and demographic controls from the blood pressure analysis).

    Simulations were validated by comparing a subset of results to results that were computed analytically based on the 24.1 percentage point increase of Medicaid enrollment among treatments.Our simulation Stata code is available for download here. The analytic method is described here.

    The Experiment was carefully conducted and provides a wealth of new information about the effects of Medicaid on household finances, mental health, and healthcare utilization. However, it was underpowered to provide much insight into the physical health effects of Medicaid.


    Not included in our letter were the charts at the end of this post that relate effect size to power for all the measures in the study’s Table 2. To help you translate the proportional effect sizes into absolute values, first, here’s Table 2:

    Table 2 from the OHIE

    OHIE table 2

    You can multiply the relative effect sizes in the charts below by the control group mean to convert them to an approximation of the absolute, postulated effect size with Medicaid coverage. The horizontal line at 80% is the conventional cutoff for adequate power.

    The relative effect sizes in the middle chart below may seem small. But, remember, this is for the entire sample of subjects, most of whom are not candidates for improvement in these measures. They don’t have a blood pressure, cholesterol, or glycated hemoglobin problem. When you adjust effect sizes for the proportion of subjects with such issues and compare those to the literature, you find that the study was underpowered. We’ve already blogged about this here and here. For Framingham risk scores, the literature is uninformative, and we cannot conclude whether the study was adequately powered for those.

    Hopefully you can match up the lines in these charts with Table 2 from the study, above. If you have any questions, raise them in the comments.




    • Bravo, excellent work. I’m curious to see the letters that are getting published.

    • First off- I agree with your criticism of the critics. Secondly, the majority of the problems I have in this case is with the over-reliance on NHT.

      Increasing the sample size will not change the effect size (clinical significance) it will only change the statistical significance associated with it. For example, you give the 95%CI for the reduction in blood pressure ranges from 28 to 44%. Meaning the point estimate for reduced elevated blood pressure was 36%. Increasing n as you suggest ONLY changes whether you want to title that as “statistically significant”. Increasing the n only reduces the width of the confidence interval.

      I think it is mildly miss-leading to use the term “detect” as you have. [i.e. “It would have required 36,100 participants to detect it at 80% power.”] You are seemingly defining detect to mean “finding statistical significance,” but to the lay reader detect seems to imply that you would need 36,100 to find a reduction of 30% not necessarily to find it significant.

      I don’t have access to the original article, but If there was an 18.5% crossover, why wasn’t a cross classified multi-level model used to account for this in the original analysis?

      • You can’t please everyone. Our prior posts were specifically focused on the expected effect sizes (from prior work) and the study’s power with respect to those. Readers asked for the effect size for which the study would have had 80% power. You can read that off the charts in this post. We’re just being thorough and responsive. Take what you like. Ignore the rest.

    • Austin, you are asking why your letter wasn’t published. It is reasonable to use all evidence possible when trying to determine the relevance of a study.

      However, let me quote from The American Statistician 02/2001 vol 5 #1 pg 19. Hoenig and Heisey

      “Dismayingly, there is a large, current literature that advocates the inappropriate use of post-experiment power calculations as a guide to interpreting tests with statistically nonsignificant results.” …”Researchers need to be made aware of the shortcomings of power calculations as data analytic tools and taught more appropriate methodology.”


      It is well known that statistical power calculations can be valuable in planning an experiment. There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result. Advocates of such post-experiment power calculations claim the calculations should be used to aid in the interpretation of the experimental results. This approach, which appears in various forms, is fundamentally flawed. We document that the problem is extensive and present arguments to demonstrate the flaw in the logic.

      • You’re straining. I did not ask why the letter wasn’t published. They expressed why (see first paragraph in italics).

        In any case, we’re not using power calculations in the way described in that paper. That’s the entire point of the exercise of reviewing the literature to come up with the expected effects sizes, which we did and to which we link. That analysis showed that this study was not powered for effect sizes that were known in advance of the study, let alone for its own point estimates. In this case, one can really say it was underpowered within reasonable expectation. If we had no prior information, one can’t draw that conclusion. This is precisely why I didn’t draw one for the Framingham risk scores, for which reliable prior information is not available.

    • I don’t know why you think I am straining. According to the authors of the paper I quoted from, it sounded as if the post-experiment power calculation for the Oregon Study was inappropriate. Though the NEJM publishes a lot of studies that have low power I have never seen this type of analysis in this type of situation. I may have missed it or the NEJM might think it an inappropriate use of the power calculation. I couldn’t understand your comment having to do with what you were using the power calculation for. My understanding was that you used the power calculation as one way to explain the results away.

      Maybe the answer to the post-study power calculation according to the NEJM will be demonstrated by the letters to the editor. If another letter uses the power calculator as the primary or sole argument against the conclusion then I believe the NEJM thought there was at least some merit to the power calculation even if they had some disputes about it. They just didn’t want to publish more than one letter on the subject.

      If no letter uses the power calculation as an argument or it is just briefly mentioned in a longer letter then I believe the NEJM likely doesn’t weight that argument significant enough to be published. My bet is that this is what will be seen though it would not bother me in the least if my guess is wrong. My immediate object is to figure out the validity and the merit of the power calculation.

      • “My understanding was that you used the power calculation as one way to explain the results away.”

        Incorrect. We are not explaining away, just explaining. I do think you’re straining to view our work as anti- or pro-something. It isn’t. We’re just explaining the results more fully and, by the way, in a manner consistent with what the authors did in their discussion for glycated hemoglobin. We’re just doing it in more detail and across all measures.

        We’ve already explained this so many times. I am sorry to say, the conclusion we are beginning to draw is that you are trolling us. Sorry, that’s just how it looks.

        Do you serve on the NEJM editorial board?

        • If I were trolling I would not be providing the citations I do. I didn’t provide an editorial, rather the article of a respected researcher who is an expert in the field. Is he entirely correct? I can’t be sure because his paper is not my area of expertise and I don’t think it is in yours either. However, his comments in the text of the study are very revealing: “Dismayingly, there is a large, current literature that advocates the inappropriate use of post-experiment power calculations as a guide to interpreting tests with statistically nonsignificant results.” …”Researchers need to be made aware of the shortcomings of power calculations as data analytic tools and taught more appropriate methodology.”

          My interpretation of what you said even after rereading your oldest postings was exactly as I stated. If I read it that way then I can only assume others did as well, but others may not be familiar with a lot of the literature out there. We also have to realize that how one says something is sometimes almost as important as what was said. I won’t further debate your words here for they are in black and white for any interested party.

          If I truly misinterpreted what you said in relationship to the researcher I believe your corrections would have been more direct and you would not have found the need to hold my posting for such a long length of time since they were obviously a valid responses though not to your liking. Moreover, none of my postings have been impolite nor inflammatory, extraneous or off topic, the definition of trolling according to Wikipedia.

          Actually, I would have liked for you to have provided a direct answer to the statement made by Hoenig and the impression you left. Maybe you have insight that he was lacking and that would have been a most interesting and academic exchange. Instead you unnecessarily try to justify your position that I accepted at face value when you presented it.

          My relationship to the NEJM editorial board is not a consideration for discussion.

          Do I agree with you? At times yes. Do I agree with the right assuming you are more from the left. At times yes. But most of the times when it comes to healthcare I have significant disagreements with both sides. Since you appear to be more from the left (not talking about work product) than from either the right or the classical Liberal I’ll present you with a major disagreement I had with the other two. I do not agree with Medicare Advantage and have had some pretty hot disputes with regard to that program. They, however, never delayed a post nor did they ever question my responsible nature, nor accuse me of trolling.

          I’m sorry that you feel the way you do. When I question someone, I am questioning their logic or their data. If I get a good logical answer I might not agree, but I place that answer in my Funk and Wagnall’s.

          PS Two other postings still have been withheld; Puzzle, The sky didn’t fall…

          • What you’re failing to understand is that we have not just tested power for the study’s point estimates (which has limited value) or for just a range of potential effect sizes (which isn’t by itself very informative), but have also specifically done so for the effect sizes one should have expected prior to the OHIE work. Hoenig and Heisey do not address this specific approach. This is a crucial point and one I already made. I’ll make it again.

            Basing power calculations on effect sizes from prior work is the same thing one would do prior to the study. (If the OHIE investigators took this step, they never published it. In that sense, we are correcting the record.) This approach is not a post-hoc calculation in the sense of the paper you cited. It’s the same as a prospective calculation, or the one that should have been done prospectively. When it shows a power problem, there really is a power problem. Knowledgeable readers were asking for precisely this in the comments, and we did it. It’s all documented with citations to literature and methods on this blog. We even provided our code. If you have a problem with any of that, state your objection precisely.

            Hoenig and Heisey do not refute this. If you read the paper and understood it you’d know that. If you need to rely on experts to explain, well, I just have (again). Do you understand now? If so, then I ask that you acknowledge that the study was, objectively, underpowered when you comment on its lack of significance or how it failed to meet expectations. Without noting the power issue, your words are deceiving. Normally one interprets insignificance and failure to meet expectations as a negative finding, not an inherent inability to detect the expected finding if it exists, which is what lack of power implies.

            You brought what you thought was credible, contrary evidence, and I have now responded to it twice. Instead of asking for clarification, you brushed it aside as an attempt to explain away the results. Or you just ignored it. I am telling you, that is rude. That is trolling. If you want to comment here stop doing that. NEJM is within its rights to select what comments it publishes. We can only go on what they tell us as their reasons. You seem to be reading in vastly more, as if you’re privy to additional information. Are you? You won’t say. Again, that feels like trolling to me. On this blog it is anyway. You’re using speculation instead of evidence, which is what we’re about.

            Anyway, here, we are also within our rights to select what comments to publish. I am telling you, we will not post comments that imply our work is incorrect or not relevant without any credible evidence to substantiate it. Our standards are higher. We will not tolerate rudeness and trolling. And, yes, we are the arbiters of those. Read the comments policy.

            “Two other postings still have been withheld; Puzzle, The sky didn’t fall…”

            I see your comments on the other posts you referenced. What are you talking about?

    • That Hoenig and Heisey didn’t address your specific approach doesn’t make your approach correct especially based upon how their paper was worded. As you know from my first posts I did not exclude the power calculation from consideration, but you have not convinced me that it adds anything significantly different than the confidence levels and other metrics already performed. The potential flaws due to size are easily appreciated, but this study complements other studies.

      If you wish maybe you can post an independent paper by a statistician that states what you believe true and that your power calculations add something significant that is not already demonstrated by other metrics such as confidence levels. I would appreciate that and it would certainly help prove your point.

      The researchers had to contend with a random study that existed within the confines of the program. The power calculation to my understanding assists people that are creating a study so that they can judge sizes that would be most appropriate considering the findings would be unknown. That does not mean that the designated sizes are necessary. You seem to have found another use for it.

      I am not saying that you are wrong, but I am not saying that you are right either. I am not an expert. I am not happy with the casual impressions I obtained from others that do not seem to agree with you. Maybe you want to ask Hoenig and Heisey for their opinion and whether they feel the power calculation adds significant weight to the other metrics commonly used. I have already stated I am not an expert on this subject so my opinion doesn’t really matter. I am more interested in how everything stacks up and that involves a lot of studies and a lot of different metrics so at best the power calculation, no matter its significance, would at best be of limited value for my purposes.

      I regret using the term “explain away” because it has so inflamed you, but I thought too much focus was being placed on the power calculation and the combined studies were being neglected. My interest is not in proving Medicaid hurts people or doesn’t work. It is in creating a proper safety net based upon Medicaid despite its warts. I am not discarding anything that you say and that has been stated in various fashions several times. I just don’t simply accept any opinion rendered as gospel, even my own. Nowhere in my search for truth have I been trolling or rude though you might believe that to be my intent. It isn’t and that should be clear. I cannot help it if my opinion differs from yours.

      As discussed before you don’t have to post anything I say, but obviously you accept the legitimacy of my position even if you may not agree and that is why you bother to respond to so many of my posts on various topics. I have noted many postings of others that are both ignorant and non responsive to the discussion, but those posts do not seem to be bothered with.

      I do note that my posting (ex: Puzzle) was held for quite a long time and it seems that its posting along with the postings of others occurred when you were ready to write and post a response. I checked for those posts over and over again on the blog itself noting other posts being posted at later and later time intervals while mine remained in limbo.

      I think something of this nature, holding or censoring posts should be dealt with off the blog and privately. Emily is willing to leave whenever you decide that Emily has no place here. That, however, doesn’t solve your problem.

      • Emily,

        I have found your posts both interesting and enlightening – I too have struggled to find a way to interact with Austin and Aaron – not always in a civilized way – almost all my fault. But I HAVE learned a great deal from this blog and think I have a better understanding of the complexity of the challenge we face. Austin and I differ on both what to fund and how to fund it. I would prefer a system that covers health care – both chronic and catastrophic and provides a way to assure that no one faces financial ruin from an accident or illness. I may be wrong, but my impression is that TIE generally supports a broader program that provides third party payment of things that I think should be paid out of pocket.

        I have come to believe that I am better off for engaging than avoiding – my views are not perfect – and have evolved – but the only way the can evolve is to have smart people share there perspective – which happens here pretty much every day.

        So hang in there – engage and know that this is not easy or simple – but important.


        • I could quibble about what TIE generally supports, but I’d get deep into nuance. So, I’m just tossing a mild flag on this point and leaving it at that for now.

          But I really appreciate this comment. I have also evolved through this blog, and by interacting with readers. We all can do so, though some choose not to. The biggest mistake some commenters make is to post with the attitude that this is a place to win arguments. It’s not. It’s a place to learn, and to accept the limits of what is known and knowable. That’s a very uncomfortable place for some people, it seems.

    • Thank you very much LL. Sometimes it is a challenge to get one’s point across and having it accepted as a positive contribution. I try to look at concepts and the logical extension of those concepts in the real world. If a person is logical then agreement never need occur. The discussion will be that much better.

      IMO I believe that healthcare belongs in the market place. We should interfere with the market place as little as possible. Isolating the cracks in the market place, which is really rectifying existing social problems, should be done as a separate item even if those problems cost money or require subsidies. I also believe that we emphasize healthcare too much as I consider much of healthcare a luxury item. The trade-offs are devastating.

      IMO somewhere between one-third and one half of our expenditures are wasted so that such a reduction of money spent in that amount would not affect quality. The money not spent could be used elsewhere to save and improve the lives of all our citizens. The top priority of those living below a certain level that are not sick (the vast majority) is not healthcare. They need jobs and to be integrated into society on an equal footing. If that occurs they will be able to afford their own healthcare. Too much social intervention can do more harm than good and it can dilute the intervention necessary for those truly in need.

      Hopefully my understanding with Austin is at a better level and my postings will be more readily accepted and posted. I too find this blog of interest and intellectually stimulating. I am not sure what Austin’s thoughts are on all these issues. He seems to have an enquiring mind so I expect his opinions to evolve just like mine have over the years.

      PS: I take note of the fact that I responded to you and then read Austin’s response about evolving. It seems we both have the same idea in mind. I look at this blog as a place of learning. I don’t care if I win or lose an argument. As I said before if I am found to be wrong then I owe the one that corrected me a ‘thank you’ for he has taught me something I didn’t know before.