• Instrumental Variables vs. Randomized Trial

    I’ve made the claim that good observational studies of a medical therapy can be as informative as a randomized clinical trial (RCT). By a “good” observational study I mean one that handles the non-random selection of individuals into treatment appropriately, which often means using instrumental variables (IV). (Already lost? Read this.)

    One way to demonstrate that IV studies are comparably informative as RCTs is to show that results obtained either way are similar. Unfortunately, there are not many examples of health care treatments studied via both RCT and IV methods because use of the latter is rare in the field. Nevertheless, there are a few examples. Steve Pizer wrote about one in his tutorial paper on IV technique.

    A clearer focus on comparing methods was provided by Stukel et al. (2007), who used four different methods to assess the effects of cardiac catheterization on elderly patients hospitalized for acute myocardial infarction. … The investigators compared results from randomized trials to estimates from models using multivariate risk adjustment, propensity score risk adjustment, propensity score matching, and instrumental variables estimation featuring the regional cardiac catheterization rate as the identifying instrument. … Multivariable risk adjustment, propensity score risk adjustment, and propensity score matching all produced estimated reductions in mortality risk between 46 and 49 percentage points. Instrumental variables estimates were starkly different at 16 percentage points and compared more favorably to estimates from clinical trials, which ranged from 8 to 21 points. …

    So, the IV estimates were in the middle of the range found by RCTs. Meanwhile, estimates based on methods that can’t control for unobservable factors that affect selection and outcome (risk adjustment, propensity score techniques) produced results well outside the RCTs range. That’s precisely what one would expect if one understands IV and why it is necessary.

    Stukel et al. (2007) comment on an earlier IV study of cardiac catheterization by McClellan, McNeil, and Newhouse (1994) that found a lower reduction in mortality risk using differential distances to alternative types of hospitals as instruments. Results of the two studies are not reported in the same metric so they are not immediately comparable. However, there is sufficient information to make at least an approximate conversion (hint: see the asterisk footnote of Table 5 of Stukel et al. that provides a formula to approximately convert between an absolute mortality difference and a relative mortality rate). Doing so reveals that McClellan, McNeil, and Newhouse report an 8.5% reduction in mortality risk, nearly half that of Stukel et al., though still within the 8-21% range of RCTs.

    Stukel et al. attribute the difference in results between the two IV studies to differences in the degree to which instruments predict treatment, suggesting that the earlier study’s results may be biased downward due to weak instruments. McClellan, McNeil, and Newhouse note that the mortality reduction they find is “achieved during the first day of hospitalization and therefore appears attributable to treatments other than the procedures.” (See also Newhouse and McClellan 1998.)

    IV and RCT results compare favorably in studies of the effects of smoking by pregnant women on their child’s birth weight. Evans and Ringel (1999) use cigarette taxes as an instrument for smoking and find that birth weight is lower by 353-594 grams, depending on model specification. Results from an RCT on prenatal care that included a smoking cessation component puts the figure at 400 grams. Results for indicators of low (< 2,500 grams) and very low (< 1,500 grams) birth weight are also similar between the IV- and RCT-based studies.

    More thorough analysis of randomized vs. observational design results are found outside of health services research. For example, Cook, Shadish, and Wong (2008) compare randomized versus observational results from twelve job training and education program evaluations.

    Of the 12 recent within-study comparisons reviewed here from 10 different research projects … eight of the comparisons produced observational study results that are reasonably close to those of their yoked experiment, and two obtained a close correspondence in some analyses but not others. Only two studies claimed different findings in the experiment and observational study, each involving a particularly weak observational study. Taken as a whole, then, the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within study comparison literature.

    Of the observational studies that did produce results comparable to experimental counterparts, one involved IV and three exploit quasi-randomness akin to that upon which IV relies (regression-discontinuity). The unavoidable conclusion is that observational studies for which sources of exogenous randomness can be identified produce results comparable to those that might be obtained from a randomized controlled experiment.


    Thomas D. Cook, William R. Shadish, Vivian C. Wong. (2008). Three Conditions under Which Experiments and Observational Studies Produce Comparable Causal Estimates: New Findings from Within-Study Comparisons. Journal of Policy Analysis and Management, Volume 27, Issue 4 (p 724-750).

    William N. Evans, Jeanne S. Ringel. Can higher cigarette taxes improve birth outcomes? Journal of Public Economics 72 (1999) 135–154.

    McClellan M, McNeil BJ, Newhouse JP. Does more intensive treatment of acute myocardial infarction in the elderly reduce mortality? analysis using instrumental variables. JAMA. 1994;272:859-866.

    Newhouse JP, McClellan M. Econometrics in outcomes research: the use of instrumental variables. Annu Rev Public Health. 1998;19:17-34.

    Stukel, T.A., Fisher, E.S., Wennberg, D.E., Alter, D.A., Gottlieb, D.J., Vermeulen, M.J.: Analysis of observational studies in the presence of treatment selection bias: effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. JAMA 297(3), 278–285 (2007). doi:10.1001/jama.297.3.278

    Comments closed
    • How about instances where propensity models and RCTs were similar, IV’s were not? Or instances of discordance between IV and RCTs? Could above just be a skewed sampling of positive correlation?

      Also, if the world was run by economists (off topic, but have fun):

      • @Brad F – These weren’t cherry picked. They’re all the comparisons between IV and RCTs in health services I’m aware of (I consulted some other experts–folks who should know). One could do a wider study going outside health services. I included a bit of that, but one has to stop somewhere.

        I’m open minded about this. I expect some IV studies would not produce valid results. But I believe in the theory of IV (anyone who gets the math should as well–it’s pretty air tight) so the problem in such cases is the lack of valid instruments. Since one can test instruments in many (not all) cases, the most important thing one should ask about an IV study is, what are those test results. Some authors don’t report them, which is an insult to science.