In “Instruments, Randomization, and Learning about Development,” Angus Deaton pulls no punches. He’s just as brutal, blunt, and precise about pitfalls and misuse of instrumental variables (IV) as randomized controlled trials (RCTs).
I found insightful his emperor-has-no-clothes argument that the RCT is not deserving of its “gold standard” reputation, despite rhetoric to the contrary. I speculate RCTs have achieved their special status for several reasons:
- They are relatively conceptually simple, requiring less mathematical and statistical training than is required of many other methods. (Though, the basic explanation of them hides a lot of complexity, which leads to improper use and interpretation, as Deaton shows.)
- RCTs address the problem of confounding from unobservables (though this fact is not unique to RCTs), which, historically, has been a major impediment to causal inference in social sciences and in the advancement of medicine. (As Deaton explains, such confounding is not the only problem confronting empirical methods, and RCTs do not necessarily address the others better than nonexperimental methods.)
- RCTs lend themselves to a routinized enterprise of evidence-based change (e.g., in medicine) in a way that other strong methods for causal inference do not (or not yet). Equivalently simple approaches that could be easily routinized offer far weaker support for causal inference. It is plausible to me that promotion of RCTs as the methodologically strongest approach to causality has spared us from many more studies of associations that can’t come even close to RCTs’ validity for causal inference, imperfect though it may be. It’s possible association-type studies could do a lot of damage to human welfare. (Evidence-based, pre-RCT medicine was pretty sketchy, for example.) This, perhaps, is the strongest moral justification for claiming that RCTs are “the gold standard,” even if they do not merit that unique standing: a world in which that is less widely believed could be much worse.
- Perhaps because of the forgoing features of RCTs, they have been adopted as the method of choice by high-powered professionals and educators in medical science (among other areas). When one is taught and then repeats that RCTs are “the gold standard” and one is a member of a highly respected class, that view carries disproportionate weight, even if there is a very good argument that it is not necessarily the correct view (i.e., Deaton’s, among others). Another way to say this is that the goldenness of RCTs’ hue should be judged on its merits of each application; we should be careful not to attribute to RCTs a goldenness present in the tint of glasses we’ve been instructed to wear.
Let me be clear, Deaton is not claiming (nor am I) that some other method is better than RCTs. He is simply saying that there does not exist one method (RCTs, say) that deserves preferential status, superior to all others for all subjects and all questions about them. I agree: there is no gold standard.
At the same time, applying some standards in judging methodology is necessary. How this ought to be done varies by context. Official bodies charged with guarding the safety of patients (e.g., the FDA or the USPSTF) are probably best served with some fairly hard-and-fast rules about how to judge evidence. Too much room for judgement can also leave too much room for well-financed charlatans to sneak some snake oil through the gate.
Academics and the merit review boards that judge their research proposals or the referees that comment on their manuscripts have more leeway. My view in this context is that a lot rides on the precise question one is interested in, the theoretical or conceptual model one (or the community of scholars) thinks applies to it, and the data available to address it, among other possible constraints. This is not a set-up for a clean grading system; there’s no substitute for expertise and opinions will vary. These are major limitation of the acceptance that there is no hierarchy to quality of methodology, in general.
Below are my highlights from Deaton’s paper, with my emphasis added. Each bullet is a direct quote.
- [Analysts] go immediately to the choice of instrument , over which a great deal of imagination and ingenuity is often exercised. Such ingenuity is often needed because it is difficult simultaneously to satisfy both of the standard criteria required for an instrument, that it be correlated with [treatment] and uncorrelated with [unobservables affecting outcomes]. [...] Without explicit prior consideration of the effect of the instrument choice on the parameter being estimated, such a procedure is effectively the opposite of standard statistical practice in which a parameter of interest is defined first, followed by an estimator that delivers that parameter. Instead, we have a procedure in which the choice of the instrument, which is guided by criteria designed for a situation in which there is no heterogeneity, is implicitly allowed to determine the parameter of interest. This goes beyond the old story of looking for an object where the light is strong enough to see; rather, we have at least some control over the light but choose to let it fall where it may and then proclaim that whatever it illuminates is what we were looking for all along.
- Angrist and Jorn Steffen Pischke (2010) have recently claimed that the explosion of instrumental variables methods  has led to greater “credibility” in applied econometrics. I am not entirely certain what credibility means, but it is surely undermined if the parameter being estimated is not what we want to know.
- Passing an overidentification test does not validate instrumentation. [Here's why.]
- The value of econometric methods cannot and should not be assessed by how closely they approximate randomized controlled trials. [...] Randomized controlled trials can have no special priority. Randomization is not a gold standard because “there is no gold standard” . Randomized controlled trials cannot automatically trump other evidence, they do not occupy any special place in some hierarchy of evidence, nor does it make sense to refer to them as “hard” while other methods are “soft.” These rhetorical devices are just that; metaphor is not argument, nor does endless repetition make it so.
- One immediate consequence of this derivation is a fact that is often quoted by critics of RCTs, but often ignored by practitioners, at least in economics: RCTs are informative about the mean of the treatment effects  but do not identify other features of the distribution. For example, the median of the difference is not the difference in medians, so an RCT is not, by itself, informative about the median treatment effect, something that could be of as much interest to policymakers as the mean treatment effect. It might also be useful to know the fraction of the population for which the treatment effect is positive, which once again is not identified from a trial. Put differently, the trial might reveal an average positive effect although nearly all of the population is hurt with a few receiving very large benefits, a situation that cannot be revealed by the RCT.
- How well do actual RCTs approximate the ideal? Are the assumptions generally met in practice? Is the narrowness of scope a price that brings real benefits or is the superior ity of RCTs largely rhetorical? RCTs allow the investigator to induce variation that might not arise nonexperimentally, and this variation can reveal responses that could never have been found otherwise. Are these responses the relevant ones? As always, there is no substitute for examining each study in detail, and there is certainly nothing in the RCT methodology itself that grants immunity from problems of implementation.
- In effect, the selection or omitted variable bias that is a potential problem in nonexperimental studies comes back in a different form and, without an analysis of the two biases, it is impossible to conclude which estimate is better—a biased nonexperimental analysis might do better than a randomized controlled trial if enrollment into the trial is nonrepresentative.
- Running RCTs to find out whether a project works is often defended on the grounds that the experimental project is like the policy that it might support. But the “like” is typically argued by an appeal to similar circumstances, or a similar environment, arguments that depend entirely on observable variables. Yet controlling for observables is the key to the matching estimators that are one of the main competitors for RCTs and that are typically rejected by the advocates of RCTs on the grounds that RCTs control not only for the things that we observe but things that we cannot. As Cartwright notes, the validity of evidence-based policy depends on the weakest link in the chain of argument and evidence, so that by the time we seek to use the experimental results, the advantage of RCTs over matching or other econometric methods has evaporated. In the end, there is no substitute for careful evaluation of the chain of evidence and reasoning by people who have the experience and expertise in the field. The demand that experiments be theory-driven is, of course, no guarantee of success, though the lack of it is close to a guarantee of failure.
The paper is very readable, though I skipped (or lightly skimmed) a middle section that did not appear to have a high density of general advice, if any. There’s some math, but it’s simple and, in some places, important for understanding key points, including a few I quoted above. Only in one or two spots did I find the words insufficient to understand the meaning. Perhaps they were a bit too efficient. Find the paper, ungated, here.