Oregon Medicaid – Power problems are important

This is a joint post by Aaron Carroll and Austin Frakt. This is part of our continuing coverage of the new Oregon Medicaid study paper. Prior posts are here: Post 1, Post 2, Post 3. More are forthcoming.

People who assume we’re partisan hacks are going to take the following as a nitpicky defense, or obfuscating, or dissembling. It’s not, and we’re not. This is about the proper interpretation of research. The reason we did not have the discussion below after the initial round of Oregon Health Study results rolled out was that the debate was clearer. The results were significant, but people disagreed on their real-world importance. That’s a debate worth having, but not a technical research problem. This time, we’re disagreeing on the interpretation of the analysis, and that’s more weedy and jargony.

You see, the study did not prove Medicaid hurts people. Nor did it prove that Medicaid doesn’t help people. It failed to prove that Medicaid improved some metrics like A1C a certain predetermined amount. But what was that predetermined amount? That question is vitally important, because the study found that more people on Medicaid did improve their A1C, just not “enough”. Is that because the study was underpowered (had an insufficient number of participants)? We think that may be the case. But that question should be answerable…

So an eagle eyed reader pointed us to the Supplementary Appendix. We’re in the weeds here, yes, but Table S14c, on page 45, is “Mean Values and Absolute Change in Clinical Measures and Health Outcomes: prerandomization specific diagnoses”.

Before randomization, there were 2225 people with hypertension. If we assume that half got randomized to each arm, and then take the 24.1 percentage point increase in coverage the study reports, that means there were only 280 people with hypertension who got Medicaid, and who could be studied for this outcome. Further, those people had a baseline average blood pressure of 130/83. That’s remarkably well controlled! So there’s not nearly the room for improvement that you might assume.

There’s a similar story for diabetes. Before randomization, there were 872 patients with diabetes. Half to each group, and then the 24.1% who actually got new  Medicaid, and you’ve got about 110 patients with diabetes in the Medicaid group. And again, their average baseline A1c was 6.7%, which is pretty well controlled. How much could the Medicaid do? With respect to the percentage of patients with a A1C>=6.5, there appears to be so much imprecision in the measurement that it’s in the 95% confidence interval that they got every single person in Medicaid with diabetes under control: the baseline percentage of A1C>=6.5 was 54.0, and the reduction in the Medicaid group was -27.0 (95% CI -71.91 to 17.92).

But let’s say that these numbers were artificially low because people were undiagnosed. They still would have given us pause. With so few participants with disease, it’s hard to believe that you’d eventually amass enough people to detect a clinically significant difference. And when you look at the actual numbers in Table 2, concerns still exist. Take diabetes, for instance. Only 5.1% of the control group had an A1C>=6.5 (diabetes). Let’s assume that the starting prevalence was the same in the intervention group. That means that only 624 people (312 in each group) actually had a high A1C in the study. It appears they may also have been relatively well controlled. (Aside: With such low rates of poor health, by these measures, how generalizable are the results? We’ll consider that question another time.)

This same discussion holds for the other metrics. This smacks of being underpowered.

It appears that uncontrolled diabetes was not, in fact, especially prevalent in this population. That being the case, we’re not sure what effect, if any, you could expect Medicaid to have on this population with respect to A1C. Can we agree that if there are relatively few people in the study with diabetes, and that those who have it are relatively well controlled, then the study itself probably can’t detect a clinically significant change? This is a BIG difference than saying Medicaid could have had a significant impact, but didn’t.

It should be possible to say something like the following, only with the numbers filled in: “We believed that there would be about X people in the study who would have diabetes, and that Y% of them would have A1Cs greater than 6.5. We believed that a clinically important reduction in this percentage would be Z, and given the variability in A1C levels, the study was powered to detect that change.” We’ve reached out to the study authors to try to fill in those numbers.

Our concern remains that it appears unlikely that there were enough people with uncontrolled diabetes that you could detect a clinically significant change with statistical significance. If we’re wrong, we’re more than happy to be proven so. But if we’re right, then it has some pretty big implications for how this study should be interpreted. If we’re right, then it’s not possible that Medicaid could have achieved a statistically significant difference. The deck was stacked against the program.

Lots of people are claiming this is a smoke screen and that the ex post confidence intervals are enough of a power calculation. They’re not. Let us put this another way: Our problem with the ex post way many are talking about the study is that the analysis did show improvements. And no one is claiming that the improvements aren’t good enough clinically. (See, for example, the annotated table at the end of this Kevin Drum post.) They’re only claiming they aren’t statistically significant.

So was it that the improvements weren’t big enough, or that the sample was too small? We can clearly see from the confidence intervals how much bigger the improvements would have to be in order for them to be statistically significant with the sample available. But it’s also true that if the study was larger, by some amount, and found the same point estimates as statistically significant, we’d not be having this conversation. With a big enough sample, even the smallest differences are statistically significant, yet the study’s point estimates aren’t small. Was this study capable of finding clinically and statistically significant effects of reasonable size? This is what an ex ante power calculation is for. It informs the researcher as to what is even worth trying to examine.

We understand there are people who will claim we’re changing our tune by now questioning things. We’re not. The design of the study is fantastic. The choice of these specific outcomes and this specific analysis is what we now question.

More to come.

@aaronecarroll and @afrakt

Hidden information below


Email Address*