• Better than Harmless Econometrics

      3 comments

    Josh Angrist and Jörn-Steffen Pischke sent me a copy of their modestly titled book Mostly Harmless Econometrics: An Empiricist’s Companion (let’s call it MHE for short). If the title sounds slightly familiar then you’re probably a Douglass Adams fan–he wrote a Mostly Harmless book too–and you’d be correct in assuming that MHE is not your ordinary econometrics text.

    Angrist and Pischke claim their style has a “certain lack of gravitas.” With an emphasis on the practical and intuitive use of the most common, widely applicable, and relatively simple econometric models they provide a far less intimidating tour than most texts of techniques for the evaluation of social experiments, whether artificially or naturally randomized. Nevertheless, this book has math, more than I cared to study closely on first read, particularly in later chapters covering more advanced material.

    Yet, the writing style is far less stodgy than typical academic texts. The fun begins in Chapter 1 (Questions about Questions), in which Angrist and Pischke write,

    Research questions that cannot be answered by any experiment are FUQs: fundamentally unidentified questions. What exactly does a FUQ look like? …

    Suppose we are interested in whether children do better in school by virtue of having started school [at age 7 instead of 6]. … To be concrete, let’s look at test scores in first grade.

    The problem with this question … is that the group that started school at age 7 is older. And older kids tend to do better on tests, a pure maturation effect. … The problem here is that for students, start age equals current age minus time in school. … [T]he effect of start age on elementary school test scores is impossible to interpret even in a randomized trial, and therefore, in a word, FUQed.

    Putting aside the FUQed, Angrist and Pischke explain the essentials of causal analysis for observational studies, beginning with a gentle introduction to the selection problem and regression in Chapter 2 (The Experimental Ideal). One can gain tremendous insight with little heavy lifting by reading that brief, 12 page chapter alone.

    The real guts of the subject are presented in Chapter 3 (Making Regression Make Sense) and Chapter 4 (Instrumental Variables in Action). Slightly more advanced material is found in the final four chapters, covering fixed effects, differences-in-differences, regression discontinuity, quantile regression, and standard error estimation. I skimmed those final chapters only closely enough to know what’s there, for future reference. My main interest was in improving my understanding of IV basics, for which close reading beyond Chapter 4 is not necessary.

    MHE is not only an econometrics reference and tutorial, it’s also a guide to a subset of the observational study literature that applies sound technique. Every method is motivated and illuminated by reference to or examples from published work. That’s particularly valuable to the publishing practitioner who needs to demonstrate adherence to proven methodology by reference to prior studies.

    Thus, MHE is better than “mostly harmless,” and I recommend it highly, particularly to those who evaluate social programs, clinical trials, or otherwise wish to estimate causal effects from experimental or observational data. Yet I can think of a few, small ways MHE could be enhanced. My least important suggestion is an index of stylized facts. There are a handful of main points that the practitioner should carry around in his head, knowing he can look up the details when necessary. These might include, for example, that propensity scores only control for observable differences between treatment and control groups (pp. 86-87);  the fact that the instrument is independent of potential outcomes is a different idea than an exclusion restriction (p. 155; this, by the way, is a mind-bender and took me some time to appreciate); don’t include an outcome as one of the regressors (pp. 64-68); that non-linear models are very rarely necessary and very often lead to trouble (p. 190); among others.

    One problem with nonlinear models is that they generate biased results with two-stage prediction substitution, a fact Angrist and Pischke discuss in Chapter 4. It deserves to be mentioned, but they didn’t, that one can obtain unbiased estimates of causal effects with nonlinear models using two-stage residual inclusion (2SRI), which is surprisingly simple and easy to implement (Terza, Basu and Rathouz, 2008). This is only important in the small subset of circumstance in which linear models won’t do. One such circumstance arises in my work in which models are put to use for policy simulations. In that case, linear approximations that don’t reproduce crucial nonlinear features of a distribution can be a problem, if only in presentation (which is important).

    I’ll conclude by noting a large issue lurking in the background to which Angrist and Pischke only allude. That’s theory (by which I mean anything outside the data). What’s it for? Can one really conclude causality from data alone? The answer is “no,” but the reason is subtle. The topic almost arises twice, once in a discussion of how to decide whether a control variable is or is not an outcome variable. When one can’t use time to determine what can be the cause of what then “clear reasoning about causal channels requires explicit assumptions about what happened first, or the assertion that none of the control variables are themselves caused by the regressor of interest.” (p. 68) That’s theory folks.

    Later, on page 156 the authors write, “There is nothing in IV formulas to explain why [treatment] affects [outcomes]; for that, you need a theory.” OK then! Theory has a role. In fact, its role is larger than implied by these quotes. I assert that one can’t begin to understand if or when selection on observables or unobservables (or endogeneity in general) might occur without theory. Put it another way, the model one chooses to estimate and the manner in which one does so comes in part from theory, a point stressed by Andrew Gelman in his review of MHE (a review worth reading, by the way).

    In many cases, that theory is our own intuition, not some formal mathematical model. We know something about the world, about what can affect what, that we bring to the data. Without those extra-data notions, we wouldn’t even know what to study or how, let alone how to interpret what we find. I think this is something applied economists should appreciate. The data can reveal the size of causal effects, but only after we have decided what can cause what. Without such ideas, finding potentially valid instruments would be next to impossible. If you don’t believe me, next time you approach your analysis, ask a colleague to rename all your variables v1, v2, v3, etc. (and not provide you with a crosswalk to their actual names). Good luck!

    Later: See also the Mostly Harmless blog.

    References

    Terza JV, Basu A, Rathouz PJ. Two-stage Residual Inclusion Estimation: Addressing Endogeneity in Health Econometric Modeling.  J Health Economics 2008: 27: 531-43.

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • Instrumental Variables vs. Randomized Trial

      2 comments

    I’ve made the claim that good observational studies of a medical therapy can be as informative as a randomized clinical trial (RCT). By a “good” observational study I mean one that handles the non-random selection of individuals into treatment appropriately, which often means using instrumental variables (IV). (Already lost? Read this.)

    One way to demonstrate that IV studies are comparably informative as RCTs is to show that results obtained either way are similar. Unfortunately, there are not many examples of health care treatments studied via both RCT and IV methods because use of the latter is rare in the field. Nevertheless, there are a few examples. Steve Pizer wrote about one in his tutorial paper on IV technique.

    A clearer focus on comparing methods was provided by Stukel et al. (2007), who used four different methods to assess the effects of cardiac catheterization on elderly patients hospitalized for acute myocardial infarction. … The investigators compared results from randomized trials to estimates from models using multivariate risk adjustment, propensity score risk adjustment, propensity score matching, and instrumental variables estimation featuring the regional cardiac catheterization rate as the identifying instrument. … Multivariable risk adjustment, propensity score risk adjustment, and propensity score matching all produced estimated reductions in mortality risk between 46 and 49 percentage points. Instrumental variables estimates were starkly different at 16 percentage points and compared more favorably to estimates from clinical trials, which ranged from 8 to 21 points. …

    So, the IV estimates were in the middle of the range found by RCTs. Meanwhile, estimates based on methods that can’t control for unobservable factors that affect selection and outcome (risk adjustment, propensity score techniques) produced results well outside the RCTs range. That’s precisely what one would expect if one understands IV and why it is necessary.

    Stukel et al. (2007) comment on an earlier IV study of cardiac catheterization by McClellan, McNeil, and Newhouse (1994) that found a lower reduction in mortality risk using differential distances to alternative types of hospitals as instruments. Results of the two studies are not reported in the same metric so they are not immediately comparable. However, there is sufficient information to make at least an approximate conversion (hint: see the asterisk footnote of Table 5 of Stukel et al. that provides a formula to approximately convert between an absolute mortality difference and a relative mortality rate). Doing so reveals that McClellan, McNeil, and Newhouse report an 8.5% reduction in mortality risk, nearly half that of Stukel et al., though still within the 8-21% range of RCTs.

    Stukel et al. attribute the difference in results between the two IV studies to differences in the degree to which instruments predict treatment, suggesting that the earlier study’s results may be biased downward due to weak instruments. McClellan, McNeil, and Newhouse note that the mortality reduction they find is “achieved during the first day of hospitalization and therefore appears attributable to treatments other than the procedures.” (See also Newhouse and McClellan 1998.)

    IV and RCT results compare favorably in studies of the effects of smoking by pregnant women on their child’s birth weight. Evans and Ringel (1999) use cigarette taxes as an instrument for smoking and find that birth weight is lower by 353-594 grams, depending on model specification. Results from an RCT on prenatal care that included a smoking cessation component puts the figure at 400 grams. Results for indicators of low (< 2,500 grams) and very low (< 1,500 grams) birth weight are also similar between the IV- and RCT-based studies.

    More thorough analysis of randomized vs. observational design results are found outside of health services research. For example, Cook, Shadish, and Wong (2008) compare randomized versus observational results from twelve job training and education program evaluations.

    Of the 12 recent within-study comparisons reviewed here from 10 different research projects … eight of the comparisons produced observational study results that are reasonably close to those of their yoked experiment, and two obtained a close correspondence in some analyses but not others. Only two studies claimed different findings in the experiment and observational study, each involving a particularly weak observational study. Taken as a whole, then, the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within study comparison literature.

    Of the observational studies that did produce results comparable to experimental counterparts, one involved IV and three exploit quasi-randomness akin to that upon which IV relies (regression-discontinuity). The unavoidable conclusion is that observational studies for which sources of exogenous randomness can be identified produce results comparable to those that might be obtained from a randomized controlled experiment.

    References

    Thomas D. Cook, William R. Shadish, Vivian C. Wong. (2008). Three Conditions under Which Experiments and Observational Studies Produce Comparable Causal Estimates: New Findings from Within-Study Comparisons. Journal of Policy Analysis and Management, Volume 27, Issue 4 (p 724-750).

    William N. Evans, Jeanne S. Ringel. Can higher cigarette taxes improve birth outcomes? Journal of Public Economics 72 (1999) 135–154.

    McClellan M, McNeil BJ, Newhouse JP. Does more intensive treatment of acute myocardial infarction in the elderly reduce mortality? analysis using instrumental variables. JAMA. 1994;272:859-866.

    Newhouse JP, McClellan M. Econometrics in outcomes research: the use of instrumental variables. Annu Rev Public Health. 1998;19:17-34.

    Stukel, T.A., Fisher, E.S., Wennberg, D.E., Alter, D.A., Gottlieb, D.J., Vermeulen, M.J.: Analysis of observational studies in the presence of treatment selection bias: effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. JAMA 297(3), 278–285 (2007). doi:10.1001/jama.297.3.278

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • The Evidence on Salt

      4 comments

    Thanks go to BradF for some pointers to the evidence and debate over salt’s effect on health. He directed me to Marion Nestle of the blog Food Politics who links to a dozen or so articles and commentaries on the issue. Disclosure: I have not read those articles and commentaries.

    I have read the 5 Feb. 2009 NY Times opinion piece in which Michael Alderman argues that a randomized controlled trial is necessary to determine salt’s effect on health. (Alderman is also the author of a 3 Feb. 2010 JAMA commentary that makes the same points and cites the academic literature.) In his NY Times column, Alderman writes,

    The best available evidence on how salt consumption affects our health comes from observational studies. … Nine such studies … have had mixed results. In four of them, reduced dietary salt was associated with an increased incidence of death and disability from heart attacks and strokes. In one that focused on obese people, more salt was associated with increased cardiovascular mortality. And in the remaining four, no association between salt and health was seen.

    People who advocate curtailing salt consumption typically prefer to discuss two other observational studies from Finland and Japan, where salt consumption is generally higher than in the United States. In both of these, more salt was associated with more cardiovascular problems.

    But observational studies do not demonstrate causality. …

    Nevertheless, the research on salt intake can help identify questions to address in randomized clinical trials, the most rigorous kind of medical research.

    Since I haven’t read it, I’m not going to comment on the evidence pertaining to a salt-health connection. I will make two points about Alderman’s view. First, I agree that observational studies are helpful in identifying hypotheses and issues to be explored in future studies, including randomized trials. So, even when observational studies don’t provide conclusive evidence, they can make scientific contributions.

    I disagree with Alderman’s statement that observational studies don’t demonstrate causality. That’s what most people seem to think, but it is far too broad. As I’ve been writing about for weeks, research designs that exploit randomness, whether purposeful (as in a randomized trial) or not (as in natural experiments) permit causal inference. Observational studies that exploit random factors that effect treatment but not potential outcomes do demonstrate causality. They differ from randomized trials only in degree.

    One possible advantage of cities imposing reduced salt regulations on restaurants is that it provides the type of randomness one can exploit to infer the causal effect of salt on health. I’m not necessarily advocating such regulations. Nor am I saying there should not be a randomized trial on salt. I’m just saying that we may be able to learn a few things even without a randomized trial. In particular, I’m saying, screaming, don’t give up on observational studies. Only some of them are uninformative about causal effects. Properly designed ones can reveal more than Alderman, and others, may think.

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • Limitations of Randomized Trials

      3 comments

    A 1998 paper by Joe Newhouse and Mark McClellan in the Annual Review of Public Health is insightful on the limitations of randomized controlled trials (RCTs) of clinical interventions. They note that a trial can only occur in the

    window of time where there is enough belief that a treatment is efficacious so that it is considered ethical to randomize patients to the treatment group but not sufficient belief in the efficacy of the treatment that it would be considered unethical to withhold the treatment.

    Even when a trial can be conducted, its results may lack generality because it is conducted in a setting (major teaching hospital, say) that differs from the norm. That is,

    the trial may demonstrate that a procedure is efficacious (i.e. obtains desired results under optimal conditions), but it will not necessarily show that it is effective (obtains desired results under typical or standard conditions). Or, somewhat related to this point, in the time since the trial was conducted physicians may have become better at performing a procedure, such that the results of the trial are no longer relevant to current practice. (Bold mine.)

    As if that weren’t enough, sometimes the population included in a trial differs from that which will (or is) actually treated generally. Women, children, the elderly, and individuals with comorbidities are among the sub-populations historically excluded from certain types of trials. That is,

    the results of the trial may have internal validity (comparisons between the treatment and control groups are unbiased for the population being studied) but not external validity (results do not necessarily apply to other populations). (Bold mine.)

    For all these reasons, well-designed observational studies can enhance our understanding of treatment outcomes by examining results over a broader setting and population than might be available in a RCT. Moreover, observational studies are less expensive, can be conducted more quickly, and are applicable in cases for which an RCT would be unethical or impractical, due to problems of recruitment, for example. Thus, as Newhouse and McClellan put it, “[t]he results of a well-designed observational study are useful even if the results of a clinical trial are available.”

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • Instrumental Variable Corrected Randomized Trial

      0 comments

    Perhaps you’re of the mind that the only way to learn anything of value is by randomized controlled trial (RCT). I disagree with that position. For some things, RCTs are the right approach. But I think the topics of study that can’t benefit from some sound observational study, perhaps in preparation for a future RCT, are few. And then there are many topics and questions that can’t be studied by RCT due to methodological, ethical, practical, or financial considerations. So, there is plenty of room for observational studies and, given that, one ought to use the best available techniques, including instrumental variables (IV).

    However, even if one only wishes to do RCTs, IV can assist.Very often the perfect randomization contemplated by the researcher is compromised. Some assigned to the treatment group don’t comply. Some assigned to the control group receive treatment. When randomization breaks down things get messy, and it may not be clear what can be learned. Colleagues and I faced this very problem on a study of the Community Nursing Organization Demonstration years ago. We solved it by analyzing the randomized groups using the notion of “intent to treat” (ITT), essentially ignoring the contamination of treatment and control groups, arguing that it is akin to what would happen in the real world anyway. We supplemented the analysis by comparing those treated to a comparison group not involved in the study.

    But we could have done something else, and I wish we had. We could have considered the random assignment as an instrument for actual receipt of treatment. One has to admit, it is a very good instrument. It is highly correlated with treatment/control assignment (since most subjects comply) and it is not related to outcomes (which is the whole point of random assignment).

    The math and statistics of this approach are very straight forward. It’s all explained in a 2006 paper by Angrist in the Journal of Experimental Criminology (and elsewhere). I won’t go into the details. Suffice it to say, even if you love RCTs and only RCTs, sooner or later you’ll come across one for which randomization has failed. In that case, IV can assist. The method is credible, sensible, and sound. Moreover, it fully exploits the beautiful properties of the randomness with which the RCT was designed.

    What one obtains with such IV-corrected RCT analysis is an unbiased estimate of the causal effect of treatment on those whose treatment status was affected by randomization (called “compliers”). This estimate is known as the local average treatment effect (LATE). In the case of an RCT of a therapy that can’t be obtained outside the experiment and in which no individual in the control group received treatment (so that all individuals who received treatment did so due to randomization and would not have otherwise), one can obtain the IV-corrected treatment effect by dividing the ITT treatment effect by the probability of treatment assignment compliance. In this simplified (but common) setting, this calculation also provides the average treatment effect on the treated (ATET). It is clear from this example that the ATET differs from an ITT estimate when compliance with treatment assignment is not perfect (ATET > ITT). Also, in this example, but not in general, the LATE is the ATET.

    The key point is that the IV-corrected estimate is as valid, meaningful, and useful as the ITT estimate. It’s just an answer to a different question. ITT examines the effect of the intervention on a population, including that due to lack of compliance. IV techniques provide an estimate of the effect of treatment on those that comply (LATE). In the case for which compliance is one-sided (no one in the control group received treatment and everyone who was treated was randomized as such), LATE = ATET.

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • IO vs Labor Economics

      0 comments

    The response by Nevo and Whinston to the forthcoming Journal of Economics Perspectives paper by Angrist and Pischke (about which I’ve been writing) is illuminating. It clearly delineates some key differences between industrial organization (IO) and labor economics. Some key passages:

    However, empirical analysis must deal not only with credible inference, but also with what might be called “generalization,” “extrapolation,” or “external validity”. … This is where structural analysis comes in. Structural analysis is not a substitute for credible inference. Quite to the contrary, in general, structural analysis and credible identification are complements. …

    Structural analysis gives us a way to relate observations of responses to changes in the past to predict the responses to different changes in the future.

    It does so in two basic steps. First, it matches observed past behavior with a theoretical model to recover fundamental parameters such as preferences and technology. Then, the theoretical model is used to predict the responses to possible environmental changes, including those that have never happened before, under the assumption that the parameters are unchanged. …

    Empirical work in industrial organization does differ in some striking ways from that in labor (and other fields that emphasize estimation of treatment effects). We have discussed extensively one important difference, the heavier reliance on structural modeling (and greater attention to issues this raises) in industrial organization, but this is not the only difference.

    Empirical papers in industrial organization are also less likely than are papers in labor to focus on pinning down a particular “number”–like an elasticity or a price effect. Many structural papers in industrial organization, for example, are focused on showing that an approach to answering a question is feasible.

    The paper is worth a full read.

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • Leamer’s EconTalk Interview

      1 comment

    Ed Leamer’s EconTalk conversation with Russ Roberts this week was among the more interesting episodes and well worth a listen. Much of it focused on issues raised in my summary of Angrist’s and Pischke’s response to Ed Leamer’s 1983 American Economic Review paper titled Let’s Take the Con Out of Econometrics. By way of quick review,

    Leamer and others in the early 1980s were distressed by the lack of testing of implications of assumptions in specification and functional form of econometric models. His proposed solution was to analyze the changes in results based on model variations (sensitivity analysis). Angrist and Pischke make a strong case that Leamer was correct in his diagnosis but not necessarily in his prescription. They argue that the “credibility revolution” experienced in empirical microeconomics since Leamer’s critique is due principally to a greater focus on research design not on sensitivity analysis.

    Angrist and Pischke argue that methodological innovations that exploit purposeful or natural randomness, including instrumental variables (IV) methods are responsible for taking the con out of econometrics.

    That doesn’t mean sensitivity analysis doesn’t have a role. On EconTalk, Leamer made a good argument that it is still relevant, important, and rare. He notes that the model published in an economics paper is just one of the many the authors probably estimated. Did they report the only one that worked, or did many produce similar results? Is the result fragile or robust? These are important questions, and the reader cannot answer them with what is provided in a typical empirical economics paper. (See also Leamer’s written counterpoint to Angrist and Pischke and other responses found on MIT’s website.)

    Nevertheless, exploiting randomness does help make econometric models more credible because (1) it removes much of the ambiguity in making causal inferences and (2) it reduces the number of needed control variables. In the case of a randomized design in which subjects are randomly assigned to treatment or control groups without any tilt to selection, no controls are required. The simple difference in means is the estimate of the average treatment effect (or the “intent to treat” effect). There are no other econometric specifications to explore to estimate this quantity. Thus, the result is trivially robust to specification.

    Similarly, in IV estimation, one can omit variables without loss of validity, though with some loss of precision, with the exception of any required for the conditional independence of the instrument and potential outcomes (the set of “conditioning variables”, which could be empty). This fact automatically increases the range of robustness of the results if they are significant without any but the conditioning variables included as controls. Of course it relies on the validity of instruments, which is either asserted (with a good argument) and/or tested (where possible). This turns much of the robustness exercise into stipulation of assumptions, which is far more compact and easier to assess.

    I believe Leamer agrees with these points because when asked on EconTalk about whether exploiting randomness and IV help address robustness, he did not deny that they did. Instead he turned to the issue of generality. He pointed out that the results of an IV study don’t necessarily generalize to other settings. That’s true. But it’s true of any study, even randomized trials. It isn’t necessarily related to the robustness issue. Leamer makes different points in his written response to Angrist and Pischke, focusing on limitations of asymptotic results. He writes that Angrist and Pischke “persuasively argue [that] either purposefully randomized experiments or accidentally randomized ‘natural’ experiments can be extremely helpful, but [they] … overstate the potential benefits of the approach.”

    All of these points–Leamer’s and Angrist’s and Pischke’s–are valid and important. They all relate to the key underlying question: when does a fact reveal a larger truth? Empirical results, no matter how obtained, provide some facts about the data. Can one apply those facts more broadly? Where’s the boundary of validity in doing so? Nobody can answer such questions generally. One hopes (or one should hope) that the range of truth revealed by econometric facts is broad enough to be of use. If not, we’re all in trouble.

    Any attempt to increase the range of validity of econometric results should be applauded. Any assertions that all econometrics is not to be trusted is overly broad. Some econometrics may still include some degree of “con,” but with correct application of modern technique a substantial amount can be driven out.

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • Instrumental Variable History and Intuition

      0 comments

    Joshua Angrist has many papers explaining and using instrumental variables (IV). In a 2001 Journal of Economic Perspectives (JEP) article with Alan Krueger he digs into IV’s history and provides some intuition for the method (official and ungated versions).

    According to Angrist and Krueger, P.G. Wright (1928) deserves credit for the first use of IV in estimating supply and demand elasticities of flaxseed. But, “Wright’s econometric advance went unnoticed by the subsequent literature. Not until the 1940s were instrumental variables and related methods rediscovered and extended.” In 1953 Theil developed two-stage least squares, the most common way to implement IV estimation.

    The rest, as one might say, is history. Only it took a long time for IV approaches to gain serious traction in many areas of applied economics. They’ve been widely used in labor economics for at least two decades, became popular in industrial organization in the last 15 years or so, and have had very little use in health services research. The increased use, where it has occurred, corresponds to greater availability of data and computational resources.

    Data and computer power exist in health services research as they do in labor or IO. So why has diffusion been slow in that field? Three reasons: (1) Randomized trials are often possible and they “crowd out” other modes of inquiry. (2) Economists are sparse in the field. And (3), the intuition of IV has not been successfully communicated to non-economist practitioners.

    An intuitive application of IV is found in Angrist’s and Krueger’s 1991 paper on the effect of school attendance on earnings, which the authors review in their JEP article. One might hypothesize that earnings are causally related to years of schooling. More school translates into higher pay. But the possibility exists that there are unobservable factors that relate to both time in school and earnings, like motivation and innate skill. Therefore, a naive estimation of the effect of earnings on years of education would produce biased results.

    A feature of state law provides an opportunity to avoid such bias. Most states require students to begin school the calendar year they turn six. They also require students to stay in school until age 16. With a cutoff of December 31, children born in the final quarter of the year begin school in September at about age 5.75. Those born in the first quarter of the year begin school in September at about age 6.75.

    Some subset of the population will quit school at age 16. By a student’s 16th birthday, she has had a number of years of schooling related to her month (or quarter) of birth. Angrist and Krueger make the crucial observation that “[b]ecause an individual’s date of birth is probably unrelated to the person’s innate ability, motivation or family connections (ruling out astrological effects), date of birth should provide a valid instrument for [length of] schooling.” That is, individuals are, in part, randomized by birth date to length of schooling (most obviously, those that quit at age 16, though likely others as well).

    The figure below, reproduced from Angrist’s and Krueger’s JEP article,  illustrates the relationship between years of education and quarter of birth for the cohort born in the 1930s (1 = first quarter, 2 = second quarter, etc.). In addition to quarter of birth, year of birth is also a factor, one that is easy to control for since it is observable.

    A-K1

    The following figure reveals that those born in early quarters of the years in the 1930s tend to earn less in 1980 than those born in later quarters. Using quarter of birth as an instrument for length of schooling (while controlling for other observable factors of relevance), permits one to obtain an unbiased estimate of the effect on earnings of duration of schooling.

    A-K2

    If one accepts that birth date is unrelated to earnings except through its affect on years of school, the use of birth date as an instrument for years of school is essentially equivalent to a randomized trial. What are the chances one can actually randomize students to years of schooling and then find them about 50 years later to measure their earnings? Not large. Hence, this example illustrates how to obtain results equivalent to those of a randomized trial in a circumstance in which one is unlikely to occur.

    References

    Angrist, Joshua D. and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect
    Schooling and Earnings?” Quarterly Journal of Economics. November, 106:4, pp. 979–1014.

    Theil, H. 1953. “Repeated Least Squares Applied to Complete Equation Systems.” The Hague: Central Planning Bureau.

    Wright, Phillip G. 1928. The Tariff on Animal and Vegetable Oils. New York: MacMillan.

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • Another Introduction to Instrumental Variables

      0 comments

    This is a follow-up to my earlier post on techniques for observational studies. Sarah Hamersma, University of Florida Assistant Professor of Economics, has written a very nice set of lecture notes introducing instrumental variables. Anyone who understands ordinary least squares regression ought to be able to follow it. If you read it, be sure to stick with it all the way through until the “Pitfalls” section at the end. The points made there are crucial.

    One thing not quite right is that Hamersma writes that there is no way to test for bad instruments, by which she means ones that fail to be exogenous (uncorrelated with u in her equation (3)). That’s not universally true. One can perform a test of “overidentifying restrictions” in cases for which there is more than one instrument. This test is described in Steve’s paper, though it can be found elsewhere.

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark
  • Observational Studies of Comparative Effectiveness

      5 comments

    This post has been cited in the 13 May 2010 edition of Health Wonk Review and is co-authored by Steve Pizer and Austin Frakt. The ideas are based on Steve’s 2009 paper “An intuitive review of methods for observational studies of comparative effectiveness.”

    Making causal inferences in observational studies is more challenging than in randomized experiments. But econometric and statistical techniques have now improved to the point that a knowledgeable practitioner can draw causal conclusions from sound observational research. Though these techniques have already been employed in economics they have not been widely applied or appreciated in health services research. Given their utility and ease of application, that should end.

    Randomized clinical trials support causal inferences with relatively uncomplicated statistical methods because they minimize the risk of selection bias. Perfect randomization ensures that treatment and comparison groups do not differ systematically in any dimensions except for receipt of treatment, so differences in outcomes can be cleanly attributed to treatment.

    Unfortunately, causal inferences in observational studies are much more difficult to make. The root of the problem is that patients (or their doctors) choose treatment themselves, using information that is often inaccessible to the researcher. Some factors that influence the treatment decision may be observable to the researcher and can be accounted for by including measures of them in the outcome model (e.g. risk adjustment models). Other unobservable factors, if correlated with both the choice of treatment and the outcome, will lead to biased estimates of the treatment effect unless specialized techniques are used. (Note, this post is couched in terms of a clinical study where the effect of treatment is of interest. That’s not critical, however. The techniques described are far more general than the language may suggest. Think of “treatment” as the main independent variable, categorical or continuous. All that’s important is that it and the dependent variable are both correlated with unobservable factors.)

    Two techniques for mitigating bias when treatment and outcome are both correlated with unobservables are propensity score matching and instrumental variables (IV) estimation. Propensity score matching is especially useful for small studies where extensive and customized data collection is feasible, minimizing the likelihood of important variables remaining unobserved. IV estimation is superior in cases where important variables are known to be unobserved, as is commonly the case in large studies of administrative data, provided that suitable instruments can be identified. A suitable instrument is a variable that has a strong influence on the likelihood of receiving treatment, but, apart from its effect on treatment, has no direct effect on the outcome.

    Local practice pattern variations make natural instruments and have been been successfully applied in several recent health services studies. Stukel and colleagues (2007) assessed the effects of cardiac catheterization on elderly patients hospitalized for acute myocardial infarction. The methodological concern was that patients in poorer health were less likely to receive invasive care, potentially making the effects of treatment look better than they actually were. The investigators used the regional cardiac catheterization rate as the instrument in this study.

    Brookhart and Schneeweiss (2007) recommend practice patterns computed at the individual provider level as instruments for observational comparative effectiveness studies. This approach has been used to evaluate changes in risk of gastrointestinal complications and acute myocardial infarction associated with use of Cox-2 inhibitors compared to nonselective NSAIDs (Brookhart, et al. 2006, Schneeweiss, et al. 2006).

    Certain conditions must be met for IV estimation to produce valid estimates. First, the instrument(s) must be strongly enough associated with the selection of treatment (Bound, et al 1995). Staiger and Stock (1997) have established statistical tests and threshold values for evaluating the strength of instruments. Second, the instruments should not be associated with the outcome except through their effect on treatment. This condition can only be tested if more than one instrument is available. In such cases overidentifying restrictions tests can be constructed to determine whether it is valid to exclude the instrument from the outcome model (Davidson and MacKinnon 1993). This is a reason to use both geographic practice patterns and individual provider-level practice patterns as instruments where possible.

    IV estimation and associated statistical tests can be performed easily with most statistical software packages (e.g. Stata) when outcomes are modeled by ordinary least squares regression. When outcomes are modeled by nonlinear functions like logistic regression or duration models then IV estimates are more difficult to obtain. However, Terza, Basu and Rathouz (2008) derived consistent two-step instrumental variables estimators for use in these situations.

    In general, the statistical and computational barriers to IV estimation are low, and the chief challenge is conceptual. Finding good instruments can be difficult and take some creativity. When they are found, application of appropriate technique leads to valid causal inference. Thus, given the cost and challenges of randomized trials, IV estimation is a valuable tool for comparative effectiveness research.

    References

    Pizer SD. An intuitive review of methods for observational studies of comparative effectiveness. Health Serv Outcomes Res Method 2009; 9:54–68.

    Stukel TA, et al. Analysis of observational studies in the presence of treatment selection bias: effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. JAMA 2007; 297(3): 278-85.

    Brookhart MA, Schneeweiss S. Preference-based instrumental variable methods for the estimation of treatment effects: Assessing validity & interpreting results. International J Biostatistics 2007;3(1);Article 14

    Brookhart MA, Wang PS, Solomon DH, Schneeweiss S. Evaluating Short-Term Drug Effects Using a Physician-Specific Prescribing Preference as an Instrumental Variable. Epidemiology 2006; 17(3): 268-75.

    Schneeweiss S, et al. Simultaneous Assessment of Short-Term Gastrointestinal Benefits and Cardiovascular Risks of Selective Cyclooxygenase 2 Inhibitors and Nonselective Nonsteroidal Antiinflammatory Drugs: An Instrumental Variable Analysis. Arthritis Rheum 2006; 54 (11); 3390–8.

    Bound J, et al. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. J Am Statistical Assoc 1995; 90 443-50.

    Staiger D, Stock J. Instrumental variables regression with weak instruments. Econometrica1997;65:557-86.

    Davidson R, MacKinnon JG. Estimation and Inference in Econometrics. Oxford University Press, NY, 1993.

    Terza JV, Basu A, Rathouz PJ. Two-stage Residual Inclusion Estimation: Addressing Endogeneity in Health Econometric Modeling.  J Health Economics 2008: 27: 531-43.

    • Twitter
    • Facebook
    • Digg
    • Delicious
    • Google Buzz
    • Yahoo Buzz
    • StumbleUpon
    • Share/Bookmark