• Updated power calculation

    Last night I had a false start in updating power calculations for the Medicaid Oregon study. The final result with corrections and a new PDF is here. If you’re into this stuff, it’s worth a look. The bottom line is that the study was underpowered for the change in proportion with elevated glycated hemoglobin (GH) by a factor of 23. Yes, twenty-three. You can use what I posted to run the numbers for other outcomes yourself.

    What happened last night is that I had failed to update the computation of R2, which requires some algebra or a simulation, neither of which I was prepared to do at 10PM. I’ve done both this morning and the results are documented in the updated post and PDF linked therefrom.

    Let me now state where we are. With respect to the statistically insignificant physical health measures in the study, we now know they were very underpowered. The sample was too small even for much larger effects. This renders them statistically uninformative in general and, in particular, uninformative about whether or how much Medicaid improves physical health. Uninformative means just that. No new information. No resetting of priors is justified on this question.

    We also know, from the authors’ discussion and from Aaron’s posts, that the results include changes in blood sugar and blood pressure that are not unreasonable to have expected clinically. Thus, the results — or these two anyway, but I suspect it generalizes — are not clinically informative either. Again, no resetting of priors is warranted.

    Given this, for the physical health measures only, I don’t understand the rush I’ve noticed in people updating what they expected Medicaid could do. These results really shouldn’t do that if they are, as I’ve said, uninformative both clinically and statistically. How did people make these judgements the day after the study was published? It’s taken me and Aaron almost two weeks to chase things down. I think it is time for people to take another look at what this study is saying, at their own priors, and, yes, at their own biases.

    What I think we’re seeing is a re-expression of everyone’s priors. This study is an opportunity to do that, but it doesn’t and shouldn’t change what they are. The claims that people should be changing sides from pro- to/from anti-Medicaid expansion just make no sense based on the physical health measures in this study.

    Meanwhile, yes, this study reconfirms some large financial and mental health benefits that we knew about from last year’s paper from the group. I’m not sure that’s a prior-updating event either.

    This was (is) an excellent study done by a smart and capable team of investigators. The results, to the extent they are meaningful, should be viewed as among the most credible possible within the context of the study (in/around Portland, over 2 years, Medicaid circa 2009). And yet, much too much is being made of the set of results that just don’t tell us anything new.

    I’m happy to be corrected on any of this if I’ve overlooked or misinterpreted anything. As always, I am merely trying to be scientific.

    UPDATE: I mistakenly wrote “30” instead of “23”. It was a very bad rounding from 22.9.


    • If this is “an excellent study done by a smart and capable team of investigators” I am very disappointed that the authors didn’t clearly make the point that the study was very underpowered, and that no policy conclusions could be drawn from the results. The had the data all along, and an obligation to evaluate it.

      Yes, this was written for an academic vs. general audience. But the authors must have known that it would become fuel for the ACA debate. I find this just as egregious as Rogoff and Reinhart burying a “correlation is not causation” statement in their now-famous paper, and then going on all the talk shows and op-ed pages making policy recommendations that assume causation.

      • Underpowered results such as these is typical of a lot of biomedical studies, particularly in high impact journals. Journals like NEJM which are the paradigm drivers.

        The idea is to provoke larger, more complete and hopefully definitive studies. And free up the resources to fund said. Which will eventually be published in lower tier journals.

        This is just the normal tarry and push.

    • The PDF says that the scholars used IV because noncompliance “could have biased the results had this been run as a straight RCT.”

      This seems not quite right. If the concern was bias from noncompliance, that’s addressed by using intent-to-treat analysis.

      Of course, they didn’t want to feature the much lower effect sizes from the intent-to-treat analysis (which showed the effect of offering the program) and instead wanted to calculate the LATE for Medicaid compliers by multiplying the effect sizes by 4.

      Note that they did this even though the people who comply are clearly different –smarter, more conscientious, etc., than the rest of the pool. So the effect sizes they found are almost certainly biased upwards compared to the effects that would be found if everyone else was automatically enrolled in Medicaid.

      (Intuitively, people who are too dumb — sorry for being blunt — to even bother signing up for Medicaid on their own are also too dumb to do much of anything else to improve their health, such as taking blood pressure medication when prescribed, etc. There’s not much anyone can do to help these people, I’m afraid.)

      • LATE is exactly that, local, in the sense of comparing compliers. If the lottery loser compliers are significantly different from the lottery winner complies, that could be a source of bias. One has to also admit that ITT has a bias.

        Anyway, I am sticking with the standard reasons for use of an IV, which is for bias reduction.

      • I would be extremely leery of using an intention to treat analysis. This is a conservative estimator that has a strong bias towards the null when adherence is very low. Given that it is unclear whether the recruitment mechanism would be representative of the uptake of Medicaid in full roll-out, it could under-state the effectiveness of the intervention.

        Using the LATE approach, they more directly get at the causal question (which is the effect of the program). If we thought the efficacy of the program was high enough, we could go to completely different paradigms (like requiring people to opt out of default medicaid coverage — which I am not suggesting but which is a theoretically possible policy response).

      • IV gets rid of bias IF you’re asking what the effect on compliers is and IF you’re willing to admit that the effect size cannot be extrapolated to anyone else.

        Think about this from another angle: if you got perfect compliance with the experimental protocol by excluding 100% of the control group from Medicaid and automatically enrolling 100% of the treatment group, the effect sizes on actual health would probably still be very small, even though most of your discussion of the power issue would be mooted.

        Why? Because many of the responsible people in the control group would find a way to get the necessary care anyway, while the irresponsible people in the treatment group aren’t going to bother taking their blood pressure meds no matter how much insurance they have.

    • It’s the conclusions drawn from the results, not the results themselves, that are troubling. Indeed, even an expert like Frakt has taken several weeks and lots of time to analyze the results and then compare the results with the researchers’ conclusions. I’ve asked how social scientists, as well-trained and experienced as they may be, can ever reach conclusions about health and wellness. I appreciate that collaboration between the hard sciences and the social sciences can promote evidence based public policy decision-making, but the conceit of the researchers can just as easily undermine it. Of course, this is related to the debate about evidence based medicine (EBM) and integrating individual clinical experience with the best available external clinical evidence, and the resistance of many clinicians. I fear that unsupportable conclusions like those of the Oregon researchers also have the potential of undermining confidence in EBM. There’s more at stake here than the reputations of a few researchers, no matter how well-intentioned they may have been.

    • For what it’s worth, this Duflo, Glennerster, Kremer paper gives a formula for answering what is the minimum detectable effect (or alternatively sample size needed) for an experiment with partial compliance: http://www.nber.org/papers/t0333.pdf formula 13, page 33

      I plugged in:

      MDE (effect size) = 0.01
      1-kappa (power) = 0.8
      alpha (1-significance) = .975 (this is because it’s a 2-sided test)
      P (share getting instrument) = 0.5
      sigma^2 = .05*.95 (the variance of a Bernoulli w/ p=.05, an approximation for the regression sigma)
      c – s (increase in takeup of treatment due to instrument) = 0.25

      I got N=238,606 which is about 20 times the size of the actual survey of 12,229 people. The difference from your number is probably due to the approximations I made.

      Or you can plug in N=12,229 and find that the minimum detectable effect ex ante at that power would be 0.044. There are some problems here with heteroscedasticity since the treatment group would have a lower sigma than the control group under that effect size. But the point holds that the effect would have to be enormous to detect it.

      The other big takeaway is that if you can reduce the error variance of the regression, e.g. by including controls that explain A1C or by including a baseline measure of it, the power to identify an effect rises enormously. These options weren’t on table since I doubt there was an opportunity to interview people before they got their lottery result and A1C was poorly explained by their control variables (see the NEJM appendix footnote 4).

    • “Uninformative means just that. No new information. No resetting of priors is justified on this question.”

      What a strange conclusion. The confidence intervals weren’t infinite, so you should indeed update your priors (not “reset”, whatever that means, but update.) Of course if you had very strong priors, this would have little effect…but why would you have very strong priors about something like this that hasn’t really been studied very much?

      And anyway, the study *did* find some statistically significant results on depression and financial security, results that were praised by this very blogger!

      • Eh? Wasn’t it clear I was referring to the not statistically significant physical health measures? Wasn’t it clear I meant uninformative on the size of the effect since nobody should believe the upper limit and the lower limit is equally preposterous? The study doesn’t seem to have ruled out anything anybody should have expected. Uninformative.

        • I think the point is that if you have a Bayesian viewpoint, then even 1 data point should change your posterior belief about the effect of the treatment. (Assuming of course that the study is otherwise designed properly.) You ought to arrive at the same belief after seeing 100,000 identically-designed studies with n=1 or 1 study with n=100,000. You can’t do that if you throw out each individual study because it’s underpowered at a particular effect size, confidence level, and power.

          Of course, the frequentist viewpoint has other advantages, but in this case, where data are likely to come in trickles, it seems useful to be able to learn something from small samples. That said, it sounds like this study shouldn’t move our priors _very much_ on these questions where you show it’s underpowered; my point is just that that’s different from saying it offers no new information on those questions.

    • Dear Austin,

      I think the study is indeed quite informative, and you need to do is to look at the enormous amount of effort that has been expended by proponents of Medicaid to downplay or dismiss the results.

      Here’s a simple way to interpret the results of the study, based on your work. Let’s use your estimate that the study was underpowered by a factor of 23; we can then agree that even expanding the sample size by a factor of 10 would be insufficient. There were approximately 4,000 people receiving Medicaid, versus a control group of 8,000 .Then, since the IV analysis specifically looked at the effect of Medicaid coverage (not just the offer of coverage, but actually accepting it), we can say:

      “If we compare physical health outcomes for a population of 40,000 people who receive Medicaid versus a population of 80,000 who do not, the effect of Medicaid coverage is indistinguishable from pure noise.”

      I think that if I had made that statement one month ago, most people would have dismissed me as a complete lunatic. Yet your work establishes that this statement follows from the best data we have available. In that sense, it’s informative.

      • This seems like a strange argument. Who says that most people have reasonable intuition about the correspondence between anticipated effect size and required sample size at a particular power and confidence level? Particularly when the study is using somewhat-fancy econometrics to work around the way the experiment was implemented.

        • Dear Henry,

          To some degree we are quibbling about the precise semantics of the word “informative”. All I’m saying here is that if people anticipated one thing a month ago, reasonably or not, based on intuition or evidence or whatever, and then empirical results are different from what was anticipated, it’s not unreasonable to say that those results are “informative”. If those prior anticipations were based on faulty assumptions, that’s fine, in fact it makes the data even more “informative”.

          But the broader point remains — just because the study didn’t achieve statistical significance doesn’t mean that it doesn’t tell us anything. You don’t need any fancy econometrics to see that the study was underpowered — all you had to do was to look at the p-values! But the fact that it was underpowered with such a large sample indicates that the effects are small, and that is indeed informative (and unanticipated).