A follow-up to this post is here. It includes instructions on how to run your own power calculations.
Let’s do the math. In the Oregon study, 5.1 percent of the people in the control group had elevated GH [glycated hemoglobin, aka A1C, or colloquially, blood sugar] levels. Now let’s take a look at the treatment group. It started out with about 6,000 people who were offered Medicaid. Of that, 1,500 actually signed up. If you figure that 5.1 percent of them started out with elevated GH levels, that’s about 80 people. A 20 percent reduction would be 16 people.
So here’s the question: if the researchers ended up finding the result they hoped for (i.e., a reduction of 16 people with elevated GH levels), is there any chance that this result would be statistically significant? [...] The answer is almost certainly no. It’s just too small a number.
I plugged these numbers into Stata’s sample size calculation program (sampsi) to do a power calculation for the difference between two proportions. I found that the probability that we can reject the null hypothesis that Medicaid would have no effect on GH levels to be 0.35. Under ordinary interpretations of statistical significance, the null cannot be rejected. We knew this from the paper, and, hence, all the hubbub. (Never mind that we also cannot reject a much larger effect. The authors cover this in their discussion.)
The standard level of statistical significance is rejecting the null with 0.95 probability. Assuming the same baseline 5.1% elevated GH rate and a 20% reduction under Medicaid, what sample size would we need to achieve a 0.95 level of significance? Plugging and chugging, I get about 30,000 for the control group and a 7,500 treatment (Medicaid) group. (I’ve fixed the Medicaid take-up rate at 25%, as found in the study.) This is a factor of five bigger than the researchers had.
- I’m taking the baseline rate, 5.1% from the study itself. But we know it is estimated with some imprecision. Maybe a different, more reliable rate is available elsewhere, but I don’t know what it is. Suffice it to say, it would take a lot of error on this number to overcome a factor of five in sample size. Assuming, again, a 20% reduction due to the intervention, I calculated that if the baseline rate were about four times bigger (e.g., about 20% instead of 5.1%), then the sample in the paper would have been sufficient to reject the null at the 95% level.
- The analysis in the study is not as simple as a straight comparison of two proportions. There is some multivariate adjustment for observable factors. There are some tweaks to due to measurement of multiple outcomes and weighting for survey design. It’s also an IV analysis, which retains in the sample patients who were randomized to treatment (won the lottery) but didn’t enroll in Medicaid. (This actually decreases power.) For these reasons, my power calculation is not fully correct. But, still, it would take a lot to overcome a factor of five in sample size.
- It is always possible I’ve made an error. If so, I’m happy for someone to correct it. Below is my code and output, using sampsi. Anyone think I did something wrong?
local r = 0.0093 /* absolute change in rate due to intervention, from the paper */
local p1 = 0.051 /* baseline rate, from the paper */
local p2 = `p1′ – `r’ /* treatment group rate */
local sd1 = sqrt(`p1′*(1-`p1′)) /* standard deviation of baseline rate */
local sd2 = sqrt(`p2′*(1-`p2′)) /* standard deviation of treatment group rate */
local n1 = 6000 /* control group size, approx from the paper */
local n2 = int(0.25*`n1′) /* treatment group size, approx from the paper */
sampsi `p1′ `p2′, sd1(`sd1′) sd2(`sd2′) n1(`n1′) n2(`n2′)
Estimated power for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
and m2 is the mean in population 2
alpha = 0.0500 (two-sided)
m1 = .051
m2 = .0417
sd1 = .219998
sd2 = .199903
sample size n1 = 6000
n2 = 1500
n2/n1 = 0.25
power = 0.3517
UPDATE: Mentioned the power reducing effect of IV.