• If you’re not into instrumental variables (IV) econometrics and/or power calculations don’t bother reading this post. I’m not even going to try to make it widely accessible. But if you are an econ/biostats type, I have a question for you. I want to know if you have seen anything like the following in any paper or book. I am looking for a supporting reference, if it’s out there.

One thing that came up in the power calculation discussions I’ve been hosting here lately is that instrumenting for the treatment indicator reduces power. It’s convenient to do a power calculation ignoring that fact, pretending as if you are running a randomized controlled trial. But if you’re really running an IV, how much more sample do you need to achieve adequate power? Or, put another way, how much does the IV sap your power?

Every time I’ve seen this question raised, the next thing I see is that it’s too complicated to figure out. Except that it turns out that, in the linear case at least, it really isn’t. My colleague Steve Pizer did the math and got a nifty little result. To convey it, it’s simplest to consider a two-stage least squares (2SLS) set up with no controls, like this:

X = Zγ + ν   (first stage)

Y = Xβ + ε

where X is the vector of treatment indicators, Z is the vector of instruments, Y is the vector of outcomes, and all the usual assumptions for 2SLS apply. (To consider a case for which there are additional control variables, first regress the treatment, instrument, and outcome variables on them and compute the residuals. Then use the above on these residual versions. The result below still works out, with one change.)

Assume you have done a power calculation that suggests you need N observations in the treatment group* to obtain a sufficiently powered estimate of the effect of treatment X on outcome Y, pretending it’s a randomized trial (no IV). Steve showed that the IV setup requires N/R² observations, where R² is the “R-squared” of the first-stage shown above. That is, the less predictive power the first stage has (the lower its R²) the more observations you need, which is intuitive. Also, if the instrument is the treatment indicator (a limiting case), R² is obviously 1, and you get back the result that you need N observations for sufficient power. Finally, if your instrument has no predictive power, R² is 0, and you need infinite observations, which is sensible. (In the case for which you did this in the residual space to handle additional controls, the number of observations required is X’X/R². It just turns out that X’X = N when X is a vector of treatment indicators.)

This is such a simple, appealing result that someone else must have written it down in some book or paper. My question for you is, who and where?

Steve’s derivation is below. I can’t be bothered to type up all the equations because it’s a pain. I apologize for his handwriting, though he may not.

* Go ahead and assume N observations in the control group too, though I think this all works just fine if the power calculation is done such that the control group size is some specified proportion of the treatment group size, like rN for some scalar r > 0. @afrakt

item.php
• This is why I come here

• I believe the Sargan test is derrived in a similar way. The Sargan test statistics is N*R^2.

• Forgot the citation.
Sargan, John D. (1958) “The Estimation of Economic Relationships Using Instrumental Variables”, Econometrica 26, 393-415

• Looked in the sections on weak instruments in Mostly Harmless Econometrics and the online Imbens/Wooldridge notes and nothing jumped out at me immediately, . It may be buried in the Imbens/Wooldridge notes on weak and many instruments here http://www.nber.org/WNE/lect_13_weakmany_iv.pdf

Also, these random lecture notes I found with a quick google search of “2SLS variance” seem to contain a similar proof http://www.soderbom.net/lec2n_final.pdf

• I haven’t checked Steve’s math, but assuming it’s right, this means that the Oregon study requires about 15 times as many observations for the same power as an RCT with perfect treatment compliance. (For a linear probability model where a single RHS dummy variable with mean 0.5 increases the probability of the outcome by 0.25, the R^2 is 1/15.) You can check this with the following Stata code:

set obs 8
gen x = 0 in 1/4
replace x = 1 in 5/8
gen y = 1 in 1/2
replace y = 0 in 3/5
replace y = 1 in 6/8
reg y x

• Great post. Thanks, Austin!

• I haven’t looked through the calculations, but here’s a quick and dirty adjustment for the simplest case: Suppose you have a binary IV Z (indicating offered Medicaid) and a binary indicator of treatment received X (indicating enrolled in Medicaid), and an outcome Y. Say we’re testing for an additive effect of X on Y, let’s call it c = (mean with treatment)-(mean without treatment), and that nobody in the Z=0 arm has access to treatment (so X=0 everywhere Z=0). Now suppose that p% of the active arm are “dropouts” with Z=1 but X=0 (e.g. those who were offered but didn’t enroll in Medicaid). Then the intent to treat effect (Z on Y) is (1-p)c, because the mean response in the control group becomes (1-p)*(mean with treatment) + p*(mean without treatment) while in the control group it is simply (mean without treatment). Because we’re assuming the IV is valid, testing for a null ITT effect is equivalent to testing for a null treatment effect. Therefore the sample size calculated by assuming a perfect IV (or perfect compliance in an RCT) should be scaled up by a factor of 1/(1-p)^2 to account for the effect moderation. From previous posts I see the rate of uptake in the Oregon study is about 25%, which would in fact give you a required 16x increase in sample size.