The ideas in Uncontrolled, by Jim Manzi, are not only worth reading but worth contemplating deeply. The book has three parts which focus on (1) the history, theory, and philosophy of the scientific process, (2) the application of scientific and statistical methods to social science, (3) policy recommendations. In this post, I’m going to ignore (1), write only briefly about (3), and focus mostly on a few ideas in (2). This does not imply I endorse or think unimportant parts of the book about which I don’t specifically comment.
The book is not pure randomized controlled trial (RCT) boosterism. Based on reading summaries by others, I expected the book suggest we do more RCTs and only RCTs. David Brooks wrote, “What you really need to achieve sustained learning, Manzi argues, is controlled experiments.” Trevor Butterworth wrote, “The hero of Uncontrolled is the randomized controlled trial.”
But this is not exactly what the book is about. The great virtue of Uncontrolled is that it covers both strengths and limitations of experimental and nonexperimental study designs. For instance, Manzi summarizes some of Heckman’s critiques of RCTs, one of which boils down to threats to external validity. Manzi goes on to articulate how experimental and nonexperimental studies should work together in many areas of human pursuit, including policy evaluation. Some example passages:
But experiments [by a firm] must be integrated with other nonexperimental methods of analysis, as well as fully nonanalytical judgments, just as we saw for various scientific fields. [Page 142]
The randomized experiment is the scientific gold standard of certainty of predictive accuracy in business, just as it is in therapeutic medicine. […] [But] significant roles remain for other analytical methods. [Page 154]
About those roles:
[O]ne role for nonexperimental methods is as an alternative to an [RCT] when they are not practical. […] A second role for nonexperimental methods is for preliminary program evaluation. […] A third role for nonexperimental methods is to develop hypotheses that can be subsequently rigorously tested. [Pages 154-157]
Supporting more observational studies is not just turf protection. In light of the above, this passage felt counterproductive:
Analytical professionals resist using randomized experiments, because doing so renders previously valued skills less important. In the social sciences, for example, many of the exact features of [RCTs] that make them valuable—accuracy, simplicity, and repeatability—mean that they devalue the complex, high-IQ, and creative skills of mathematical modelers (at least for some purposes).
Maybe, though I know of no analytical professional who would not endorse the notion that RCTs, when and where possible, are the best way to make a wide range, but not all, causal inferences. Moreover, very interesting methodological challenges arise from RCTs, in part because they’re rarely without some imperfections that can benefit from from some analytical attention. I wish Manzi had included the other reason analytical professionals promote nonexperimental methods: they’re useful, as Manzi argued (see above).
Individual observational studies can mislead more than collections of them can. Nevertheless, individual observational studies can fail to accurately inform about causality. Manzi considers the work of Donohue and Levitt on abortion legalization’s causal effect on crime and the ensuing debate in the academic literature about their findings: they are not robust. Though each single study in this area can mislead (some suggest a causal effect, some don’t), the collection leads us closer to an answer: it does not appear very likely that legalized abortion reduces crime, or at least not much. If it did, the signal would be stronger and the evidence would more consistently show it. So, the science worked here, and without an RCT.
Another case study is that of smoking’s effect on lung cancer, which Manzie reviews. Here, many observational studies pointed, robustly, in the same direction. So, again, the science worked. (Note, neither in the case of abortion nor smoking is an RCT possible. We can’t deliberately randomize people to smoking status any more than we can to availability of legal abortions.)
A key point* is that in neither of these cases, nor many others, could we know in advance what studies would robustly demonstrate, if anything. It’s only in hindsight can we say that individual studies of the abortion-crime relationship can mislead while individual studies of the smoking-lung cancer don’t. Doing the observational studies at the time was not a mistake in either case. What’s essential, of course, is that we did not one, but many, and in particular ones with methods that support causal inference with assumptions many find reasonable.
Intention-to-treat is not the only approach of value. Manzi is a big fan of the ITT principle and is critical of estimating treatment effects based on those who are randomized to and receive treatment (aka, “treatment compliers”).
[T]hose who are selected for treatment but refuse [it] or do not get [it] for some other reason could vary in some way from those who are selected and do receive it. For example, they might be more irresponsible, and therefore less likely to comply with treatment regimens for other unrelated conditions [that also affect outcomes].
Manzi made a similar point about the Oregon Medicaid study, suggesting that the subset of lottery winners who ultimately obtained Medicaid might be more prudent than those who won but did not follow through with subsequent enrollment requirements. Maybe the results are driven by such prudence rather than Medicaid itself. If so, they’d be biased estimates of effects of Medicaid.
There is a subtle point* worth clarifying here, and one I should have made in discussion of the Oregon Medicaid study: Because the investigators used lottery winning status as an instrumental variable (IV), it is not correct to interpret the results as driven by some factor other than treatment. By definition the lottery (randomization) cannot be correlated with outcomes except through its effect on Medicaid (treatment) status. A way to think about this is that the proportion of prudent people who won the lottery is the same as the proportion of prudent people who did not. Prudence is working on both sides and, so, cannot bias estimates of treatment effects. What one obtains in an analysis of this type is an IV estimate called a “local average treatment effect” (LATE). It’s the average treatment effects over the subset of the population whose Medicaid status is affected by the instrument, the “compliers.”
Now, it is correct to say that the compliers are different, perhaps more prudent, and that threatens generality of the findings. That’s why this is subtle. On one hand, one has a genuine treatment effect (the LATE). It’s not a “prudence” effect. It’s not a biased treatment effect. On the other hand, it’s not the effect of treatment on segments of the population not affected by the lottery, and they could be different. In other words, there are heterogeneous treatment effects, and the LATE estimate is just one of them (or an average of a subset of them).
An ITT estimate is different, but that doesn’t make it more correct in general. It all depends on what question one is asking. Rubin offered some examples from the literature when one specifically would not want an ITT estimate:
[I]n some settings […] one may want to estimate the effect of treatment versus control on those units who received treatment. Two prominent examples come to mind. The first involves the effect on earnings of serving in the military when drafted following a lottery and the attendant issue of whether society should compensate those who served for possible lost wages (Angrist, 1990). The second example involves the effect on health care costs of smoking cigarettes for those who chose to smoke because of misconduct of the tobacco industry (Rubin, 2000).
In both cases, an ITT estimate would dilute the very effect of import. It would not answer the specific question asked. This is related to the fact that one way of guaranteeing finding no program effect is by implementing a very large lottery relative to the number of treatment slots and then estimating the ITT effect. That’s a genuine limitation of ITT. There are others, as I noted in my ITT post. See also West and Thoemmes who wrote:
However, Frangakis and Rubin (1999) and Hirano et al. (2000) showed that the ITT estimate can be biased if both nonadherence and attrition occur, and West and Sagarin (2000) raised concerns about replicating the treatment adherence process in different investigations.
My broader point, which is consistent with most of Manzi’s approach in Uncontrolled, is that rather than promote one estimation methodology over another in general, we should identify and disclose the strengths and limitations of each. Experiments are not uniformly superior to nonexperimental methods. ITT is not uniformly superior to LATE. We can usefully employ a variety of approaches, provided we’re clear on the boundaries of their applicability. Unfortunately, this is not a common view. The mantra that “RCTs are the gold standard” is a bit glib; it’s not as helpful a guide through the methodological thicket as some take it to be. (This is not a critique of the book, which, as I wrote, is not as RCT-centric as some seem to believe.)
Research won’t settle everything. Manzi concludes the book with many policy suggestions. The basic thrust of them is to support and conduct more research (including experiments) in social policy and to establish governmental institutions that would use the results of such work to inform policy change. At that level of generality, we are in agreement. I won’t get into details and quibbles here except to say that there is one thing that such an enterprise cannot do: establish which endpoints are important to study and upon which policy should turn.
For instance, should we expand or contract Medicaid programs that are shown by randomized experiments to improve mental health and financial protection but do not provide conclusive results on physical health outcomes? Should we expand or contract jobs training programs that increase employment by, say, 5% and income by 3%? What if the numbers were 50% and 30%? Should we measure health outcomes of jobs programs? Should we measure employment outcomes of health programs? What if e-cigarettes cause 75% less lung cancer than regular cigarettes, should they be regulated differently? What if marijuana use leads to more alcohol consumption, should it be legalized? Should all (or more) studies of marijuana include that endpoint? What about the effect of marijuana use on tobacco use? On risky sex? On educational attainment? On income? What endpoints are important for policy? (These are all hypothetical examples.)
My point* is that reasonable people could argue at length about what to measure, how to measure it, and what constitutes sufficient change in what measures to warrant broader policy intervention. Heck, we can argue about methods at great length: how and when to do power calculations, implications of attrition in RCTs, and validity of instruments, for example. Good observational research and experiments are important, but they’re just first steps. They don’t end debates. They don’t by themselves reveal how we ought to change policy. They only give us some partial indication of what might happen if we did so, and only where we choose to look.
Read the book and give it some thought.
* These are all points I’m making, not ones Manzi made in the book, unless I missed them.