• Oregon Medicaid – Power problems are important

    This is a joint post by Aaron Carroll and Austin Frakt. This is part of our continuing coverage of the new Oregon Medicaid study paper. Prior posts are here: Post 1, Post 2, Post 3. More are forthcoming.

    People who assume we’re partisan hacks are going to take the following as a nitpicky defense, or obfuscating, or dissembling. It’s not, and we’re not. This is about the proper interpretation of research. The reason we did not have the discussion below after the initial round of Oregon Health Study results rolled out was that the debate was clearer. The results were significant, but people disagreed on their real-world importance. That’s a debate worth having, but not a technical research problem. This time, we’re disagreeing on the interpretation of the analysis, and that’s more weedy and jargony.

    You see, the study did not prove Medicaid hurts people. Nor did it prove that Medicaid doesn’t help people. It failed to prove that Medicaid improved some metrics like A1C a certain predetermined amount. But what was that predetermined amount? That question is vitally important, because the study found that more people on Medicaid did improve their A1C, just not “enough”. Is that because the study was underpowered (had an insufficient number of participants)? We think that may be the case. But that question should be answerable…

    So an eagle eyed reader pointed us to the Supplementary Appendix. We’re in the weeds here, yes, but Table S14c, on page 45, is “Mean Values and Absolute Change in Clinical Measures and Health Outcomes: prerandomization specific diagnoses”.

    Before randomization, there were 2225 people with hypertension. If we assume that half got randomized to each arm, and then take the 24.1 percentage point increase in coverage the study reports, that means there were only 280 people with hypertension who got Medicaid, and who could be studied for this outcome. Further, those people had a baseline average blood pressure of 130/83. That’s remarkably well controlled! So there’s not nearly the room for improvement that you might assume.

    There’s a similar story for diabetes. Before randomization, there were 872 patients with diabetes. Half to each group, and then the 24.1% who actually got new  Medicaid, and you’ve got about 110 patients with diabetes in the Medicaid group. And again, their average baseline A1c was 6.7%, which is pretty well controlled. How much could the Medicaid do? With respect to the percentage of patients with a A1C>=6.5, there appears to be so much imprecision in the measurement that it’s in the 95% confidence interval that they got every single person in Medicaid with diabetes under control: the baseline percentage of A1C>=6.5 was 54.0, and the reduction in the Medicaid group was -27.0 (95% CI -71.91 to 17.92).

    But let’s say that these numbers were artificially low because people were undiagnosed. They still would have given us pause. With so few participants with disease, it’s hard to believe that you’d eventually amass enough people to detect a clinically significant difference. And when you look at the actual numbers in Table 2, concerns still exist. Take diabetes, for instance. Only 5.1% of the control group had an A1C>=6.5 (diabetes). Let’s assume that the starting prevalence was the same in the intervention group. That means that only 624 people (312 in each group) actually had a high A1C in the study. It appears they may also have been relatively well controlled. (Aside: With such low rates of poor health, by these measures, how generalizable are the results? We’ll consider that question another time.)

    This same discussion holds for the other metrics. This smacks of being underpowered.

    It appears that uncontrolled diabetes was not, in fact, especially prevalent in this population. That being the case, we’re not sure what effect, if any, you could expect Medicaid to have on this population with respect to A1C. Can we agree that if there are relatively few people in the study with diabetes, and that those who have it are relatively well controlled, then the study itself probably can’t detect a clinically significant change? This is a BIG difference than saying Medicaid could have had a significant impact, but didn’t.

    It should be possible to say something like the following, only with the numbers filled in: “We believed that there would be about X people in the study who would have diabetes, and that Y% of them would have A1Cs greater than 6.5. We believed that a clinically important reduction in this percentage would be Z, and given the variability in A1C levels, the study was powered to detect that change.” We’ve reached out to the study authors to try to fill in those numbers.

    Our concern remains that it appears unlikely that there were enough people with uncontrolled diabetes that you could detect a clinically significant change with statistical significance. If we’re wrong, we’re more than happy to be proven so. But if we’re right, then it has some pretty big implications for how this study should be interpreted. If we’re right, then it’s not possible that Medicaid could have achieved a statistically significant difference. The deck was stacked against the program.

    Lots of people are claiming this is a smoke screen and that the ex post confidence intervals are enough of a power calculation. They’re not. Let us put this another way: Our problem with the ex post way many are talking about the study is that the analysis did show improvements. And no one is claiming that the improvements aren’t good enough clinically. (See, for example, the annotated table at the end of this Kevin Drum post.) They’re only claiming they aren’t statistically significant.

    So was it that the improvements weren’t big enough, or that the sample was too small? We can clearly see from the confidence intervals how much bigger the improvements would have to be in order for them to be statistically significant with the sample available. But it’s also true that if the study was larger, by some amount, and found the same point estimates as statistically significant, we’d not be having this conversation. With a big enough sample, even the smallest differences are statistically significant, yet the study’s point estimates aren’t small. Was this study capable of finding clinically and statistically significant effects of reasonable size? This is what an ex ante power calculation is for. It informs the researcher as to what is even worth trying to examine.

    We understand there are people who will claim we’re changing our tune by now questioning things. We’re not. The design of the study is fantastic. The choice of these specific outcomes and this specific analysis is what we now question.

    More to come.

    @aaronecarroll and @afrakt

    • “But it’s also true that if the study was larger, by some amount, and found the same point estimates as statistically significant, we’d not be having this conversation.”

      This is the key line.

      I prefer to be a rough Bayesian about this, and I had no strong priors. If you had a completely flat prior, then the point estimate is our current best guess on the effect. It’s big. How confident are we… not super confident, but the big point estimate is enough to warrant further study.

      I mean, jeez, if you want to be super-frequentist about it let’s say the CI was a little smaller so we could just reject 0 at 0.05, but we cannot reject a very small effect at 0.05, so we’re, substantively, right back where we started. Why should such a marginal different have such a discontinuous effect on our conclusions? The fetishization of p<0.05 is silly.

      • Looking at their analysis, I’m also starting to think the study design was a bit unfocused.

        They threw everything into their regression – not just the three big process measures, but financial security, mean blood pressure in the total population, everything. As our Incidental Economists pointed out a few times, the most interesting measures are probably the proportion of patients that have high BP, or high cholesterol, not the average BP in a mostly healthy population. So the three big process measures got drowned in a sea of noise.

        And yet all three process measures showed an effect! 6%, 17%, and 17%. That’s actually pretty darn good. Sure we can’t have too much confidence in those point estimates now, but if the study was designed to only look at those three (by far the most interesting and relevant health measures in the study) and assumed they weren’t independent (which they almost certainly aren’t), then the CI’s would be just fine. Someone needs to suggest that to the authors for the next follow-up.

    • If you look at the related links at the bottom of the Suderman post (which I actually thought was decent), there’s a link to a Jim Epstein post from just a week ago about how Medicaid and Obamacare hurt the poor that confidently repeats every zombie lie about Medicaid. It restricts access to health care! It causes worse health outcomes! But yeah, the Oregon study totally confirms everything libertarians have been saying about Medicaid and the ACA.

    • Also, I’m pretty sure the instrumental variables technique that they (appropriately) used to account for randomization (the 24.1% you refer to) itself limits power. For every patient who receives Medicaid due to randomization, there are >3 who could have but didn’t. The IV alters the effect size to account for this, but cannot fix the confidence intervals – that 75.9% are statistical noise, making the effect slightly smaller than you implied.

      • J,

        ???? Uhm, IV methods give standard errors, whichcan calculate confidence intervals.

    • I do not for the life of me understand why the CIs aren’t sufficient to assess the power of the actually implemented study. Why do I care, at this point, whether a larger study could have found an effect this size to be significant? Of course I can find an n such that thus study is significant, but it’s too late to change n,so all I care about is what this study can detect.

      This study cannot detect a significant effect size of the magnitude found. The CI tells me how big an effect these data could plausibly be consistent with. I can look at the endpoints of the CI and ask, “do I care a lot which end of the CI this effect actually is?” If it’s a tight CI around 0, I can conclude that there probably wasn’t a big effect. If the interval includes effect sizes that we’d consider substantial if they were signifcant, then I can conclude that the study is under-powered. What does knowing what I could realistically have hoped to detect add to that?

      A better argument would be to say that you want to determine what’s clinically meaningful in advance.

      I find the line of reasoning about how well controlled the control group is to be vastly more compelling and interesting. There were a lot of people in the control group who were already on blood-pressure meds. Is this a bad health outcome to measure because those meds are cheap enough for poor people to get even if they’re not on Medicaid? That would be a much more optimistic story, and it would suggest that we’ve simply got the wrong health measures to try to assess the additional impact of Medicaid on top of what else they get–and it suggests that the financial benefits and the process measures are much more interesting and relevant than the headline grabbing blood pressure numbers. As, I believe, you’ve been saying.

      In the end, what we’d really like to see would be the effects of having Medicaid v being uninsured on mortality rates over a reasonable time span. “Unfortunately,” these people are probably going to be insured before that reasonable time span arrives. Drat.

      • If one knows ex ante the sample size needed for a reasonably expected and clinically meaningful effect, then one can compare that needed size to the actual sample. If the sample is too small, it’s questionable whether the analysis should be done, much less published.

        • We’re past the point of it being useful to question whether the research should be done. And I see no particular advantage to having done that calculation ex ante, or doing the comparison in terms of number of observations, when you’re trying to assess the value of the completed study.

          Which returns me to wondering why on Earth being able to see how large an effect you could actually have detected is inadequate for determining whether there was sufficient power ex post.

    • This would also be a very good point to remember the first rule of Medicaid studies. As my colleague Jim Fossett likes to say, if you learn something about one state Medicaid program, you’ve learned something about…one state Medicaid program. I suspect you’d get rather different results if you did the same thing in my home state of Tennessee.

      This is also a good point to remember that, even within Oregon Medicaid, we’re learning about the effects of an expansion. The really really really poor were already covered and thus aren’t measureed here. You would probably get a larger impact if you, say, did the horrendously unethical experiment of randomly kicking the entire Medicaid population off the rolls.

      • Very good point about expansion.

        This study certainly does tell us more about whether states should be participating in the ACA Medicaid expansion (and I think the study actually makes some strong arguments that they should) than whether Medicaid works as a whole. And it certainly tells us next to nothing about how Medicaid compares to private insurance, which is what a lot of twitter commentators seem to think is the key finding.

        • The people who think this study compares Medicaid to private insurance are very exasperating!!!!!!

    • Why didn’t they make composite measure since they had to have known the number of people in each individual outcome would have been under powered (not everyone of the enrollees were going to have DM). What would have been the score of a composite measure where getting any one improvement (either DM or HTN for example) would have been scored as a positive result?

      • This is their composite:

        The predicted 10-year risk of cardiovascular events was measured with the use of the Framingham risk score, which estimates risk among persons older than 30 years of age according to sex, age, levels of total cholesterol and HDL cholesterol, blood pressure and use or nonuse of blood-pressure medication, status with respect to diabetes, and smoking status, with the predicted risk of a cardiovascular event within 10 years ranging from less than 1% to 30%.

    • Some thoughts from an informed patient who knows some of the literature….

      With a study in Oregon, you’re starting with a relatively healthy population….The UnitedHealthcare Fdn annual rankings put Oregon at 15th in 2010, 8th in 2011, 13th in 2012.

      In treating the 3 major measurements in this study, there is often a long period of titration of meds or use of additive therapy. That may explain why changes were in a positive direction, but no major changes in just 2 years. An important bit of data would be… how many “doctor visits” for treatment did patients have over the two years?

      For example, often treatment of diabetes goes through several steps, with changes at each “doctor visit”….visit 1 blood test for A1c…..visit 2 (3-6mos later) try diet & exercise….visit 3, start metformin low dose….. visit 4…..metformin high dose, or add sulfonylurea-type drug….. visit 5…re-check A1c….visit 6….add another class of drug to metformin+SU…and on from there….easily needing 2 years or more to get A1c down significantly in real-life clinical settings..

      Similarly, for hypertension, it takes a number of “doctor visits” to titrate and add medications to significantly bring down blood pressure. Good control of hypertension or diabetes often requires titration and 2 or 3 meds in combination. Perhaps see the “clinical pathways” recommendations from Amer Diabetes Assn, etc.

      In Medicaid treatment, there may be a preference for older, but “weaker” medications since they are available as low-priced generics. For example, good old Pravachol is on the Walmart $4/month list at 40mg for a 29% LDL reduction……current brand name Crestor 20mg provides a 48% LDL reduction. It would be understandable that Medicaid patients might get longer trials of lower-cost, less effective medications before “moving on up” to newer medications.

      For this study, A1c, LDL and BP may be easy, repeatable measurements….. but not the primary problem the patient needed treated. A patient who has gone without health attention likely comes to a first “doctor visit” with some immediate-need treatment, or acute or symptomatic conditions which the doctor decides to treat first, before beginning on long-tem preventative regimens. This may help explain the high satisfaction rates among patients, even if A1c, etc didn’t change much.

      Here in Oklahoma, the political world is still using the Univ of Va surgery outcome study as “proof Medicaid is worse than no insurance”…..and praising a local surgicenter which posts low prices, but can reject any patient as a high risk. Is there a way to explain the Oregon study in a way that’s meaningful to lay folks and politicians? For the “under-powered” argument, is a comparison of the patient numbers for pharma studies for efficacy of drugs for diabetes, hypertension, etc compared to how many patients with these conditions were in the Oregon study going to be helpful? Other suggestions?

      [“docor visits” in quotes because I’m using it as shorthand for visits to NPs, PAs, etc.]

    • I had expected to comment regarding some aspects of the sad state of statistical understanding related to the article, as has been done already; however, after seeing that the elephant in the living room is invisible, I thought I would point out to what I suspect is an economist audience that the most compelling (and indeed exceedingly compelling) finding is in regards to depression. Alas, readers (who I assume have health economics and health policy backgrounds) seem as in the dark about the dire significance/meaningfulness of depression as the average person is about the poor use and communications related to statistical matters. I have not looked at the specifics of diagnoses or outcomes regarding depression, so I am only going on the newspaper coverage about the study showing rates of depression reduced by 30% (whatever that means). Nonetheless, whatever it means is enormous, as depression is a known contributor to everything from misery, poor parenting, risky behaviors/substance misuse/abuse to suicide, as well as being associated with reduced worker output and, when especially severe, disability. Both here and in the press it is as if the lead article on Buffett was, “Buffett shows mixed ability to manage wealth.” Why comment on statistical aspects related to physical ‘markers’ presumably (and sometimes doubtfully) even related to real-world outcomes, when the giant elephant in the living room is considered non-existent because it is not conceived of as a ‘physical’ illness.

    • I hope one of you guys will at least touch on the validity of the instrumental variable approach they used in the analysis, as a lot of people (myself included!) either don’t understand it or are just glossing over it.

    • The analysis was under powered–it could not detect reliably effect sizes that were expected given the population and what is known clinically. For example, hospitalization rate increased 300%, but this effect was not statistically significant. Similarly, smoking increased 15%, which is the equivalent to a change in tobacco taxes of 100% or so. These two negative health effects suggest that Medicaid was not very helpful, although some may argue that going to hospitalmore is health improving.

      The important question, which may have a political answer, is why this study was funded. On most NIH review committees I sit on, the question of power is of first order importance. I believe that part of the funding for the study came during the windfall NIH money in the stimulus bill. Of course the all-start team of researchers may have gotten a pass on the details of the study, but in the end, the study ended up spending a lot of money for little information. The one thing everyone knew was that giving someone a card for free, unlimited medical care will increase the use of care. It was the health effects of providing insurance that was the question in need of an answer.

    • One of the few things I remember from college statistics is that it is said illegitimate to compare proportions of proportions. “Thirty percent is fifty percent more than twnty percent” is as it stands a meaningless gabble.

      This comes to mind when I see that we are talking about *hyper*tension. “Hyper” compared to what?

      Now if we add in the fact that everything about hypertension depends upon the subjective reporting of subjective phenomena — even throgh the magical Tvesky and Kahneman are supposed to have at least in principle given some meaning to interpersonal subjective comparisons, it seems to me we are on somewhat marshy ground here.

      OK, the tension has blood pressure as its proxy, maybe. And what happens to blood pressure when somebody in a white coat appears?

      Even when you’re dealing with machined steel at controlled temperatures, (I used to manufacture auto parts) there isn’t much truth beyond the second or third significant digit.

      With this stuff here I’d think you’d be lucky to tell a plus from a minus — and to try putting even one decimal digit on anything would seem to me ambitious.


    • In discussing the Oregon study, I told a friend that I would try to ferret out some of the details/information. What I eventually wrote to him is below:

      Alas, the Oregon study seems to me to be quite a mess. I understand that the authors were trying to take advantage of the natural occurrence of this unusual lottery for Medicaid benefits and that they were dependent to a considerable extent on the data that were available, as well as having intense time/money constraints and being health economists not particularly sensitive to the clinical interpretation of mental/behavioral health issues, but I am afraid their data are not really interpretable and their descriptions and conclusions are not worth trying to think about once one realizes how the mental health data were formulated. To give you only three examples (as to some extent I do not think there is a single presentation of non-physical results that is without significant misleading aspects): (1) Data were collected on whether the subjects said ‘yes’ to ever having received a diagnosis of depression, but since the authors were primarily interested in the effects of receiving or not receiving the Medicaid benefits, the authors only counted the subjects as ‘depressed’ if they received the depression diagnosis between the date of the first eligibility for entrance into Medicaid and the taking of the survey that asked about receiving the diagnosis. (“We consider participants to have post-lottery first diagnosis if they reported having received it prior to March 2008 or after. Those never having received a diagnosis or having received it prior to March 2008 were considered not to have a post-lottery first diagnosis.” [pg 14, supplementary materials]).
      Now how can you talk about changes in depression when subjects that happened to have depression a few months before others wind up considered not to have ever been depressed.

      (2) Though at some point some of the subjects are also asked about all their medications and the list of medications was then analyzed, including for ‘antidepressants,’ how does one make any judgment when “For example, we considered anyone taking a medication classified as an antidepressant to be taking medication for depression (even though that drug may have been prescribed for a different indication).” Furthermore, I could not make out how this group of antidepressant takers was related to those asked about having received a diagnosis, nor to who were the subjects classified as ‘depressed’ because those subjects scored ten or greater on the PHQ-8. (Also, the authors used the PHQ-8 rather than the standard PHQ-9 by eliminating the one item that asks about suicidal ideation because “…the question about suicidal ideation (which is rarely answered in the affirmative, and thus makes little substantive difference in scores…)) [sic, re: weird parentheses]. (I won’t even go into how many subjects failed to bring all their medications to the interview and how many did not complete the follow-up phone call.)

      (3) Lastly, it does seem that the authors did use the question about whether one was “very happy” or “pretty happy” versus “not so happy.” Yet it is not at all clear if this simple measure was what they actually reported, as when I tried unsuccessfully to find how subjects actually answered, instead of the data, I ran into: “We also asked about how individuals were feeling in general, and we construct a measure of being ‘very happy’ or pretty happy” as compared to ” not so happy.” And, it seems that the measurement constructed was a complex mix of this simple dichotomy plus several other general measures and all of it was weighted to be comparable to the SF-36 with a mean of 50 and a SD of 10, so I could not even be sure that the results listed in the table related to the question about happiness per se or if that happiness dichotomy was only one of the elements of the constructions of a measure of being happy or not.

      Thus, I have given up on making any sense of the Oregon studies (I looked at the earlier first year study also). Somewhere I do believe there are some interesting data, but I think a lot of it is a mess and there is no way short of a year of work to figure out what is what.

      If you are not yet asleep, then I am off to sleep myself anyway and bid you adieu.