• Sorry for the false start last night. More about that here. I still welcome comments if you find any errors.

This is a follow up to my several prior posts on how to adjust a power calculation to account for an instrumental variable (IV) design. The details are in a new PDF. (If you downloaded the one I posted last night, replace it with this one.) First, for the non-geeks:

• Skip the proof and jump to the example that begins on page two of the PDF. It runs through the numbers for the Medicaid study result for glycated hemoglobin (GH), which I had used to illustrate the power issues in my first post on this topic. (It’s a commentary on this blog’s readership that I can even consider this example suitable for non-geeks. I guess I mean geeks of a different order.)
• One thing you may notice is that the Medicaid and non-Medicaid groups are different sizes than you might have expected if you only read the paper and not the appendix. I refer you to appendix table S9 for the details. Suffice it to say, it is not true that 24.1% of the lottery winners took up Medicaid. There were a lot more Medicaid enrollees than that. (What is true is that 24.1% more lottery winners took up Medicaid than non-winners.)
• For that reason, and because I was targeting 95% power, my estimate in my first post was quite a bit off. I thought the study was underpowered by a factor of 5 for the GH measure. Actually, according to the methods in that post, and using the new numbers and targeting 80% power (which, I am told, is more standard), the study is only underpowered by a factor of 1.5.
• But, as I wrote in that post, I had not accounted for the IV design. The new calculation does so. And that, my friends, really wallops power and precision. The bottom line is, accounting for the design, the GH analysis was underpowered by about  a factor of 23 (yes, twenty-three!) meaning it’d have needed that multiple of sample to be able to detect a true Medicaid effect with 80% probability.
• You can run the numbers for other measures using this online tool. The underpowering will vary. Below is a screenshot for the inputs for the GH analysis. Follow the steps in the PDF for the rest. (Hint: multiply the sample sizes from the online tool by 14.8.)

Now, for the uber-geeks, the content of the PDF differs from my prior version of a few days ago in three ways:

1. It properly accounts for the fact that we were assuming all vectors were zero mean. That didn’t affect the result, but it does affect how you should simulate the first stage (which we’ve done for you for the Medicaid study in the document).
2. It references Wooldridge, who obtained the same result. (So, we’re right!)
3. It includes a complete example from the Medicaid study. However, don’t overlook the fact that this generalizes. Truth be told, I didn’t do all this to comment on the Medicaid study. I need this for my own work.

I should point out that the finding that the variance of the effect size in an RCT scales with the inverse of Np(1-p) is beautiful. It doesn’t just scale with 1/N because it is the mean of a difference. When p goes to zero, there are no treatments. When p goes to 1, there are no controls. Either way, the variance of the difference in effect size has to go to infinity. And, indeed it does. This is comforting intuition.

Finally, I’m grateful for the awesome feedback I’ve received from readers. Once again, the TIE community has hit this one out of the park. Thank you.

@afrakt

• Dear Austin,

Thank you very much for such an informative post! I’m interested in learning more about IV methods, so this is a valuable resource, and I will be studying it quite closely. I am also grateful for the (implicit) endorsement of the Wooldridge text, and will be ordering a copy from Amazon for my library.

I’m afraid that your conclusions are underwhelming. The problem is not that your results are not true, but rather that they are trivial. In order to calculate the power of the study, you substitute in the empirical values observed in the study. But this is well-known to be uninformative — you will ALWAYS obtain a very low estimate of power by using the numbers from a non-significant result. Basically, a result is statistically non-significant when the noise in the data is large compared to the effect, but in order to obtain good power we need the opposite. So I didn’t need to follow all your work to know that you found the study to be “underpowered”; all I needed to to know was that you were using estimates from a result that was statistically non-significant as the basis for your computations.

But suppose I accept this argument that the study was underpowered. The natural conclusion to draw is that the reason why the sample size was insufficient to achieve significance was because the effect of Medicaid coverage was very small. Personally, I think the whole issue of statistical significance for such a large study is a red herring, and instead you should focus on the point estimates, which constitute our best estimates of the effect of Medicaid coverage. The study showed a reduction in the population of less than 1 percentage point in glycated hemoglobin levels — are you willing to argue that that is clinically meaningful?

• I think the previous commenter makes a good point.

It would be useful if everyone could just completely ignore statistical significance and power for a few days and instead we could just talk about the point estimates themselves. Do we think they are small or large? Relative to expectations? Relative to other research? Relative to the cost of providing the benefits? I just have very little idea how health policy researchers would answer these questions.

Sometimes I think that the whole hypothesis testing framework actually does more harm than good.

• Did you read Aaron Carroll’s posts about just that on this blog recently? If not, look for them. They were just a day or two ago.

• Hi Austin,

Theoore W, is correct, saying that, keeping the percentages the same, you would need a sample size 23 times as large, you are only indicating how statistically insignificant the result is.

To determine whether the study has enough power, it would be much better to ask: How big would the Medicaid effect have to be to get a astatistically significant result? I just skimmed through the pdf, so I am probably be wrong, but let me guess (wildly) that the effect would have to be sqrt(22.9) times as large. In other words, the point estimate of the medicaid effect would have to be -4.45%. Just a bit unrealistic
for a condition which only has a 5.1% incidence in the non-medicaid population.

Since not even the most vervent proponents of medicaid would expect that it would bring the incidence of high HG down by 90%, this clearly indicates that the study is vastly underpowered.

• I understand. Can you all either run your own numbers or be patient? I am getting to what you want to know. Stay tuned.

• I have a question, is it at all reasonable for the study to produce the “minimum detectable effect size” that is based on 80% power and the sample size? That is, instead of solving for the sample size needed for the effect size and power, solve for the effect size given the sample size and power. This might be useful in cases like this where the investigators do not have full control over the sample size-perhaps similar to the comment Aaron made about why a power analysis was not conducted for the device study.