• How much would you pay for a quality adjusted life year?

    It’s happened to me numerous times and, judging by my email, to my colleague Keith Humphreys recently. Sometimes you just need a reasonable figure and something to cite for a typical willingness to pay for a quality adjusted life year. For that reason, even before Keith’s email, I had intended to dump this abstract, from a recent literature review by Linda Ryen and Mikael Svensson into a post:

    There has been a rapid increase in the use of cost-effectiveness analysis, with quality adjusted life years (QALYs) as an outcome measure, in evaluating both medical technologies and public health interventions. Alongside, there is a growing literature on the monetary value of a QALY based on estimates of the willingness to pay (WTP). This paper conducts a review of the literature on the WTP for a QALY. In total, 24 studies containing 383 unique estimates of the WTP for a QALY are identified. Trimmed mean and median estimates amount to 74,159 and 24,226 Euros (2010 price level), respectively. In regression analyses, the results indicate that the WTP for a QALY is significantly higher if the QALY gain comes from life extension rather than quality of life improvements. The results also show that the WTP for a QALY is dependent on the size of the QALY gain valued.

    In 2010, as today, one gets about 0.75 Euros per dollar. So the trimmed mean and median estimates for the WTP for a QALY are $98,879 and $32,301, respectively. Now you know and have something to cite.


    Comments closed
  • Methods: There are bad IV studies, but IV is not bad

    In a post last week—which you should read if you don’t know what an “instrumental variable” (IV) is—I described the key assumption for a good IV: it’s uncorrelated with any unobservable factor that affects the outcome. “Unobservable” is a term of art. Practically speaking, it actually means anything not controlled for in the system of equations used to produce the IV estimate.

    Example: Consider a study of the effect of cardiac catheteriation in heart attack patients on mortality in which the investigators employ the following IV: the difference in distance between (a) each patient’s residence and the nearest hospital that offers cardiac catheterization and (b) each patient’s residence and the nearest hospital of any type. (See McClellan, McNeil, and Newhouse, for example.) Is this differential distance a good instrument? Without controlling for other factors, probably not.

    It’s no doubt true that the closer patients live to a hospital offering catheterization relative to the distance to any hospital, the more likely they are to be catheterized. But hospitals that offer catheterization may exist in areas that are disproportionately populated by patients of a certain type (e.g., more urban areas where patients have a different racial mix than elsewhere) and, in particular, of a type that experiences different outcomes than others. To the extent the instrument is correlated with observable patient factors (like race) that affect outcomes, the IV can be saved. One just controls for those factors by including them as regressors in the IV model.

    The trouble is, researchers don’t always include the observable factors they should in their IV specifications. For instance, in the above example if race is correlated with differential distance and mortality, leaving it out will bias the estimate of the effect of catheterization on mortality. This is just akin to an RCT in which treatment/control assignment is partially determined by a relevant health factor that isn’t controlled for. Bad, bad, bad. (But, this can happen, even in an RCT designed and run by good people.)

    In a recent Annals of Internal Medicine paper, Garabedian and colleagues found substantial problems of this type in a large sample of IV studies. They examined 65 IV studies published before 2012 with mortality as an outcome and that used any of the four most common instrument types: facility distance (e.g., differential distance), or practice patterns at the regional, facility, or physician level. Then they scoured the literature to find evidence of confounding for these instruments and categorized the observable factors that, if not controlled for, could invalidate them. Finally, they tabulated the proportion of studies that controlled for various potential confounders. That table is below.

    IV confounding

    Some confounders, like patient comorbidites, are often included as controls. Others, like education or procedure volume are much less commonly included. This is worrisome. Here are the authors’ final two sentences:

    [1] Any instrumental variable analysis that does not control for likely instrument–outcome confounders should be interpreted with caution.

    This is excellent advice. I agree.

    [2] Although no observational method can completely eliminate confounding, we recommend against treating instrumental variable analysis as a solution to the inherent biases in observational CER [comparative effectiveness research] studies.

    My emphasis is added, because this is devastating and, I think, inconsistent with the paper’s contribution. I cannot agree with it. Here’s why:

    • It’s one thing to find many IV studies are poorly done, but it’s quite another to suggest they should not play a role in addressing bias in observational CER. Indeed, there are many poor studies using any technique. There are bad propensity score studies. There are flawed RCTs. Does that make these inappropriate techniques for CER?
    • Let’s consider the alternatives: Propensity scores do not address confounding from unobservables, so they can’t be a complete solution all CER problems. RCTs are impractical in many cases. And even when they’re not, it’s a good idea to do some preliminary observational studies using methods most likely to credibly estimate causal effects. It seems to me we need IVs precisely because of these limitations with other techniques. (And we need the other techniques as well.)
    • The authors identified precisely how to do more credible IV studies. Shouldn’t we use that knowledge to do IVs that can better address selection bias instead of concluding we cannot?
    • Not every confounder the authors identified need be included in every IV study. Just because a factor might confound doesn’t mean it does confound. This is important because some factors aren’t readily available in research data (e.g., race, income, or education can be missing from the data—in many cases they can be included at an area level, however). To reassure ourselves that no important factors are among the unobservables, one can perform some falsification tests, which are very briefly mentioned by the authors. This validity check deserves far more attention, and is beyond the scope of this post.

    The Garabedian paper is a tremendous contribution, an instant classic. I applaud the authors for their excellent and laborious work. Anybody who is serious about IV should read it and heed its advice … right up until the last sentence. It’s possible to do good IV. The paper shows how and then, oddly, walks away from it.


    Comments closed
  • Working within the system

    Via Nicholas Bagley:


    In case you missed it, I posted on the squid’s point on Thursday.


    Comments closed
  • The deadweight loss of taxation to fund social services

    Discuss the cost of social services long enough and someone’s bound to remind you that the taxes that fund them impose a “deadweight loss.”

    A “deadweight huh wut?” you might think. As Uwe Reinhardt explains in a must-read post, the deadweight loss of taxation is a type of cost on society that arises from one fact and one assumption. Fact: money collected through taxes can’t be used freely for other purposes by the individuals from whom it was collected. Assumption: money spent freely by individuals in markets is more efficiently* deployed than that spent by government. (I’m not contesting that assumption here, but it is an assumption.)

    The concept of deadweight loss is often explained to students of microeconomics in a diagram, such as the one in my book Microeconomics Made Simple, and reproduced below with minor modification:

    effect of a tax

    For simplicity, the figure considers a market with one good, sold at the market price PE before imposition of a tax. At that price, consumers would buy quantity QE. This is where aggregate supply (S) and demand (D) curves meet. When a tax is imposed (e.g., a sales tax on this good), people pay more, PD, and suppliers receive less, PS. The difference goes to the government to fund social services and other things like waste, fraud, and abuse (#humor). It’s fairly intuitive that people will buy less of something that costs more and suppliers will produce less when they’re paid less for each unit sold. Under the tax just Qof the good is bought/sold, as shown.

    That’s all pretty obvious without a chart. What’s harder to quantify from this intuition is the deadweight loss, or how much society loses from the imposition of a tax. How do we quantify the value of the lost economic activity? It’s the area of the shaded triangle in the chart. If you recall that the area of a triangle is half its base times height, you can compute the deadweight loss for this chart as

    deadweight loss =  ½×(PD-PS)×(QE-QT).

    That’s fine, as far as it goes, but is this a cost we should attribute to the in-kind provision of social services funded by taxation? How would you respond to a claim that it is? Save yourself the trouble and read Uwe’s excellent post. You’re sure to learn something.

    * This is itself a bit of jargon. Economic efficiency is not a simple concept. As a rough approximation, consider “more efficient” to mean “makes people happier and better off,” in some sense.


    Comments closed
  • Methods: A little bit about instrumental variables

    Though there are lots of sources to learn about instrumental variables (IV), in this post I’ll point to three papers I found particularly helpful.

    I’ve already written a tutorial post on IV, based on a paper by my colleague Steve Pizer. Two diagrams from that paper make clear that IV is a generalization of randomized controlled trials (RCTs). Conceptually, an RCT looks like this:


    Randomization (e.g., by the flip of a coin) ensures that the characteristics of patients in the treatment and comparison groups have equal expected values. The two groups are drawn from the same sample of recruits and the only factor that determines their group assignment is the coin flip, so, apart from the treatment itself, all other differences between the groups must by construction be random.

    An IV study could look like the diagram below. Notice that if you ignore the patient and provider characteristics boxes on the left and the lines that emanate from them and interpret the institutional factors box at the bottom as a coin flip, this looks exactly like an RCT.


    [In an IV study, a] large number of observed and unobserved factors [could] influence sorting into treatment and comparison groups. Many of these factors are also independently associated with differences in the outcome. These relationships are illustrated by the solid arrows connecting observed and unobserved patient and provider characteristics to sorting and the dashed arrows connecting these same characteristics directly to the outcome. The arrows directly to the outcome are dashed because these relationships are not the ones of primary interest to the investigator; in fact, these are potentially confounding relationships that could make it difficult or impossible to accurately measure the effect of treatment.

    What makes an IV study an IV study is the analytical exploitation of some “institutional factors” (e.g., laws, programs) or other measurable features of the world—called instrumental variables—that affect sorting into treatment and control groups, at least somewhat, and are, arguably,* not correlated with any unobservable patient or provider factors that also affect outcomes. That’s kind of a mind-bender, but notice that an RCT’s coin flip has these properties: it’s a measurable feature of the world, affects sorting, and is not correlated with any unobservable patient or provider factors. Other things in the world can, arguably,* act like a coin flip, at least for some patients: program eligibility that varies geographically (like that for Medicaid), for example.

    The algebraic expression of the forgoing difference between RCTs and IV studies by Katherine Harris and Dahlia Remler may also be informative to you (if you’re not math-averse). They consider patient i‘s health outcome, yi, given by

    [1]     yi = β(hi)di + g(hi) + εi

    where hi denotes unobservable health status; di is a dichotomous variable that takes the value one if the patient is treated and zero otherwise; β(hi) + g(hi) is the expected health outcome if the patient receives the treatment; g(hi) is the expected health outcome if the patient does not receive the treatment; and εi represents the effect of other unobserved factors unrelated to health status. The effect of treatment for each individual is the difference between health outcomes in the treated and untreated state, β(hi). If treatment effects are homogenous, then β(hi) = β for everyone. If treatment effects are heterogeneous, then β(hi) is different for [at least some patients].

    Next, the probability that patient i receives treatment can be written

    [2]     P(di=1) = f(hi) + zi

    where f(hi) represents health status characteristics that determine treatment assignment, and zi represents factors uncorrelated with health status that have a nontrivial impact on the probability of receiving treatment.

    A potential problem in this setup is that treatment and outcome depend on health status hi, which is unobservable. If unobservably sicker people are treated and are also more likely to have a bad outcome (because they are sicker), that will bias our judgment of the effect of treatment. The way out is to find or manufacture a zthat determines treatment assignment for at least some patients in a way that is uncorrelated with unobservable health h(as well as uncorrelated with other unobservable factors that affect treatment, εi.)

    In experimental settings, researchers strive to eliminate the effect of health status on the treatment assignment process shown in Equation 2 by randomly generating (perhaps in the form of a coin-flip) values of zi such that they are uncorrelated with health status and then assigning subjects to treatment and control groups on the basis of its value. [...]

    In some nonexperimental settings, it may be possible to identify one or more naturally occurring  zi, [IVs] that influence treatment status and are otherwise uncorrelated with health status. When this is the case, it is possible to estimate a parameter that represents the average effect of treatment among the subgroup of patients in the sample for whom the IV determines treatment assignment.

    Harris and Remler go on to discuss more fully (with diagrams!) the subgroup of patients to which an IV estimate (aka, the local average treatment effect or LATE) applies when treatment effects are heterogeneous. With Monte Carlo simulations, they show that LATE estimates can differ considerably from the average treatment effect (ATE) one would obtain if one could estimate it for the entire population. Their explanation is beautiful and well worth reading, but too long for this post.

    I’ll conclude with a shout out to one more worthwhile IV tutorial paper: the MDRC Working Paper “Using Instrumental Variables Analysis to Learn More from Social Policy Experiments,” by Lisa Gennetian, Johannes Bos, and Pamela Morris. As with the Pizer and Harris/Remler papers, it’s worth reading in full.

    * “Arguably” because one needs to provide an argument for the validity of an instrumental variable. This is a mix of art and science, well beyond the scope of this post. I will come back to this in the future.

    Comments closed
  • Why you might work to improve the very thing you’d rather blow up

    Interviewed by Terry Gross a couple of weeks ago, veterinarian Vint Virga said something about reforming zoos that I thought had relevance for reforming, well, anything.

    Virga said he didn’t care for zoos. I got the impression he’d rather they did not exist because, in his view, they don’t improve the welfare of animals, in general. At one time, he contemplated not working for zoos. But he does. Why? Because zoos aren’t going away any faster if he doesn’t work for them. As such, he feels that he would do more harm to animal welfare by not doing so. The animal care techniques he discovers and implements improve the conditions of animals within zoos, but not to the extent their conditions would improve if zoos didn’t exist at all.

    If you can’t beat ‘em, join ‘em.

    Whether or how to engage with an institution (like zoos) or policy area (like Medicaid or CIA interrogation techniques) that one may find broadly distasteful or fundamentally, structurally flawed is something many of us face. In the hopes of speeding its demise, one could protest and boycott, withholding expertise that might lead to marginal improvements. A related variant is proposing radically different, yet, to one’s point of view, far better alternatives. This makes sense to the extent one’s protest or radical proposals will really change things. One has to ask oneself: Honestly, will they? That depends on who one is, perhaps, as well as a host of unknowable factors—the alignment of the political and cultural stars. What if the most likely outcome is no effect, maintenance of the status quo? Oops.

    Another approach is to engage in an effort to make small, more attainable improvements. This is distinct from supporting the enterprise or policy in general, but it could have the effect of making it work better, solidifying its legitimacy. Is that a distinction with or without a difference? If one can indeed make small improvements, it’s better than the status quo, but far worse than a (possibly unattainable) counterfactual world in which the institution or policy is killed outright or replaced with something very different. It’s going for a local optimum rather than the global one that’s out of reach.

    Research should and can be scientific. Policy change could be evidence-informed. But whether and precisely how to engage, how far to push, and on what, is more of an art. It confronts politics and culture, even religion and ethics. The judgement of our community is relevant. What will your friends, family, and colleagues think? Have you sufficiently demonstrated your distaste for policy X, institution Y, and fealty to vision Z to remain in their good graces?

    Do we attempt the more achievable goal of making the habitat a bit more comfortable? Or do we go for the far less likely outcome of tearing down zoos’ walls? In contemplating such questions, I doubt there’s anything that can guide us any more reliably than our own conscience.


    Comments closed
  • The cost-effectiveness of treatment for opioid dependency

    The following originally appeared on The Upshot (copyright 2014, The New York Times Company).

    Once championed as the answer to chronic pain, opioid medications and painkillers have become a large and costly problem in the United States. Fatal overdoses have quadrupled in the last 15 years, and opioids now causemore deaths than any other drug, over 16,000 in 2010. Prescription opioid abuse is also costly, sapping productivity and increasing health care and criminal justice costs to the tune of $55.7 billion in 2007, for example.

    Addressing this problem would cost money, too, but evidence suggests it would pay for itself.

    Much of the problematic use of opioids like Vicodin and OxyContin originates with a prescription to treat pain. Prescriptions can be of tremendous help to certain patients. But doctors write a lot more of them for opioids in the United States than they should, enough for every American adult to have a bottle of opiate painkillers each year, according to the U.S. Centers for Disease Control and Prevention.

    One approach to the problem, therefore, is to attempt to reduce the number of prescriptions written by targeting use to patients for whom the medications are most appropriate. In July 2012, for example, Blue Cross Blue Shield of Massachusetts began requiring prior authorization for more than a 30-day supply of opioid medication within a two-month period. After 18 months, the insurer estimates that it cut prescriptions by 6.6 million pills.

    Except for Missouri, all states have or plan to soon have drug databases in place that can be used to track prescribers of opioid painkillers and those that use them. Not all states require doctors to check the database before prescribing the medications, though — a provision that has been demonstrated to reduce overuse. For example, in 2012 the states of New York and Tennessee began requiring doctors to check their states’ drug monitoring databases before prescribing opiate painkillers. Both states saw substantial drops in patients who received duplicate prescriptions of drugs from different doctors, one pathway for overuse and overdose.

    Prevention and better targeting are sensible, but they do little to assist patients already dependent on opioids. Those patients need help, and more than just a short stint in a detox facility. The modern view of opioid dependency is that it’s akin to a chronic disease, like diabetes or hypertension, which requires maintenance therapy: long-term treatment with a craving-relieving substitution agent like methadone or buprenorphine, sometimes combined with other medications.

    Such treatment isn’t free, of course, but many studies have shown it’s worth the cost. Just considering health care costs alone, studies have shown it to becost-effective with substantial offsets in reduced spending on other types of health care. For example, methadone treatment limits the spread of H.I.V. by reducing the use and sharing of needles. It also reduces hospitalization and emergency department use, such as treatment of trauma, perhaps from accidents. Considering the broader, social costs of opioid dependence as well — those due to lost productivity and crime — the case that maintenance therapy for opioid dependence is cost saving is even stronger.

    Recently, the Comparative Effectiveness Public Advisory Council conducted an economic analysis of expanding opioid treatment in New England states. The council found that as access to treatment increased, total costs of treatment also grew, but savings to society increased even more rapidly. The result is that greater treatment actually saves society money. For instance, New England states could save $1.3 billion by expanding treatment of opioid-dependent persons by 25 percent.

    cepac opioid

    If maintenance therapy is such a good deal, why don’t we more readily provide it? One answer is that, though treatment works, its benefits are diffuse. A great deal of the cost of treatment would be borne by insurers and public health programs. But a great deal of the savings would be captured by society at large (through a reduction in crime, for example). As my colleagueKeith Humphreys and co-authors wrote, “If, for example, one is held responsible to keep a hospital budget in balance, spending scarce funds on [substance use disorder] treatment does not become more attractive just because it saves money for the prison system.”

    Another reason maintenance therapy for opioid dependency is underprovided is that it is still misunderstood. Culturally, there’s a temptation to view dependency as a result of poor lifestyle choices, not as a chronic disease, and to view maintenance treatment as merely substituting one addiction for another. (This is akin to viewing chronic insulin use as a mere substitute for chronic diabetes.) And, to be fair, there are real issues, such as the potential for misuse and diversion, associated with maintenance drug therapy for opioid dependency. Those deserve some attention, but not out of proportion to the problems they pose.

    It’s clear that treatment for opioid dependency is underprovided for a variety of reasons, and that this, in turn, helps promote the growth in the problems dependency causes. But it’s also clear that those dependent on opioids aren’t the only victims. Because of the social costs the problem causes, many others are as well.


    Comments closed
  • AcademyHealth: Everything you wanted to know about Medicare Advantage

    My new post on the AcademyHealth blog summarizes some of the points made in an excellent paper on Medicare Advantage by Joseph Newhouse and Tomas McGuire. Go read the post!



    Comments closed
  • A response from Jim Manzi

    Responding to my post on his book, Jim Manzi wrote me with some clarifications. His points are fair and reasonable and he authorized posting the following  quote.

    I basically agree with the vast majority of what you wrote, and will resist the urge to write 8,000 words on the incredible nuances of every detail of my thinking.

    The one overarching comment is that a huge theme of the book wasn’t how great experiments are, but rather the depth of our ignorance about the effects of our non-coercive interventions into human society. The plea of the book was to recognize this when making decisions.

    My advocacy of ITT as the default position for evaluating a trial is not because this answers the question we most want answered (it very often does not, for the reasons you describe), but because at least we can have some confidence about internal validity. That is, we can know the answer to some question, as opposed to potentially fooling ourselves about the answer to an even more important question.

    Similarly, when it comes to non-experimental methods, as you note I advocate using them. But I tried to make the point that even a collection of such analyses doesn’t only compete with other analytical methods or “make the decision blindly,” but also with the alternative of allowing local operational staff to make decisions with wide discretion. Over time, and considered from a high level perspective, this is a process of unstructured trial-and-error, which I see as the base method for making progress in knowledge (or at least, as I put it, implicit knowledge) of human society.

    Finally, one narrow point is that I think (and tried to describe at length in the book) why the causal mechanism in smoking – lung cancer is qualitatively different than that of social interventions, and therefore why the Hill approach [relying on many nonexperimental studies when RCTs are impossible] does not generalize well from medicine to sociology.

    I think it’s difficult to say that because the Levitt abortion-crime regression isn’t robust, therefore we can conclude that abortion didn’t cause some material reduction in crime. A key argument in the book is that the regression (or more generally, pattern-finding) method is insufficient to tease out causal effects of interventions. As I said in the book, I think that the rational conclusion, based only on the various analyses published on the subject, isn’t “no material effect,” but rather “don’t know.”

    This goes back to the applicability of the Hill method to social phenomena. It’s why I think the research that compares non-experimental estimates of intervention effects to what is subsequently measured in RCTs and shows these methods don’t reliably predict the true effect is so important. And the kinds of interventions that can be subjected to RCTs are generally simpler than the kinds of things like “legalize abortion in several American states” that cannot be. So, if anything the very interventions that are analyzed non-experimentally should be harder to evaluate than the kinds of interventions that are subject to testing.

    What I think would be very practically useful would be to have a large enough sample of paired non-experimental and RCT analyses of the same intervention, so that we could have rules of thumb for where the non-experimental approaches provide reasonable estimates of causal effect. I researched this for the book, and while a number of studies have been done (I footnoted them), there is nothing like the breadth of coverage to allow such an analysis as far as I can see.

    Thanks Jim, you get the final word!


    Comments closed
  • *Uncontrolled*

    The ideas in Uncontrolled, by Jim Manzi, are not only worth reading but worth contemplating deeply. The book has three parts which focus on (1) the history, theory, and philosophy of the scientific process, (2) the application of scientific and statistical methods to social science, (3) policy recommendations. In this post, I’m going to ignore (1), write only briefly about (3), and focus mostly on a few ideas in (2). This does not imply I endorse or think unimportant parts of the book about which I don’t specifically comment.

    The book is not pure randomized controlled trial (RCT) boosterism. Based on reading summaries by others, I expected the book suggest we do more RCTs and only RCTs. David Brooks wrote, “What you really need to achieve sustained learning, Manzi argues, is controlled experiments.”  Trevor Butterworth wrote, “The hero of Uncontrolled is the randomized controlled trial.”

    But this is not exactly what the book is about. The great virtue of Uncontrolled is that it covers both strengths and limitations of experimental and nonexperimental study designs. For instance, Manzi summarizes some of Heckman’s critiques of RCTs, one of which boils down to threats to external validity. Manzi goes on to articulate how experimental and nonexperimental studies should work together in many areas of human pursuit, including policy evaluation. Some example passages:

    But experiments [by a firm] must be integrated with other nonexperimental methods of analysis, as well as fully nonanalytical judgments, just as we saw for various scientific fields. [Page 142]


    The randomized experiment is the scientific gold standard of certainty of predictive accuracy in business, just as it is in therapeutic medicine. [...] [But] significant roles remain for other analytical methods. [Page 154]

    About those roles:

    [O]ne role for nonexperimental methods is as an alternative to an [RCT] when they are not practical. [...] A second role for nonexperimental methods is for preliminary program evaluation. [...] A third role for nonexperimental methods is to develop hypotheses that can be subsequently rigorously tested. [Pages 154-157]

    Supporting more observational studies is not just turf protection. In light of the above, this passage felt counterproductive:

    Analytical professionals resist using randomized experiments, because doing so renders previously valued skills less important. In the social sciences, for example, many of the exact features of [RCTs] that make them valuable—accuracy, simplicity, and repeatability—mean that they devalue the complex, high-IQ, and creative skills of mathematical modelers (at least for some purposes).

    Maybe, though I know of no analytical professional who would not endorse the notion that RCTs, when and where possible, are the best way to make a wide range, but not all, causal inferences. Moreover, very interesting methodological challenges arise from RCTs, in part because they’re rarely without some imperfections that can benefit from from some analytical attention. I wish Manzi had included the other reason analytical professionals promote nonexperimental methods: they’re useful, as Manzi argued (see above).

    Individual observational studies can mislead more than collections of them can. Nevertheless, individual observational studies can fail to accurately inform about causality. Manzi considers the work of Donohue and Levitt on abortion legalization’s causal effect on crime and the ensuing debate in the academic literature about their findings: they are not robust. Though each single study in this area can mislead (some suggest a causal effect, some don’t), the collection leads us closer to an answer: it does not appear very likely that legalized abortion reduces crime, or at least not much. If it did, the signal would be stronger and the evidence would more consistently show it. So, the science worked here, and without an RCT.

    Another case study is that of smoking’s effect on lung cancer, which Manzie reviews. Here, many observational studies pointed, robustly, in the same direction. So, again, the science worked. (Note, neither in the case of abortion nor smoking is an RCT possible. We can’t deliberately randomize people to smoking status any more than we can to availability of legal abortions.)

    A key point* is that in neither of these cases, nor many others, could we know in advance what studies would robustly demonstrate, if anything. It’s only in hindsight can we say that individual studies of the abortion-crime relationship can mislead while individual studies of the smoking-lung cancer don’t. Doing the observational studies at the time was not a mistake in either case. What’s essential, of course, is that we did not one, but many, and in particular ones with methods that support causal inference with assumptions many find reasonable.

    Intention-to-treat is not the only approach of value. Manzi is a big fan of the ITT principle and is critical of estimating treatment effects based on those who are randomized to and receive treatment (aka, “treatment compliers”).

    [T]hose who are selected for treatment but refuse [it] or do not get [it] for some other reason could vary in some way from those who are selected and do receive it. For example, they might be more irresponsible, and therefore less likely to comply with treatment regimens for other unrelated conditions [that also affect outcomes].

    Manzi made a similar point about the Oregon Medicaid study, suggesting that the subset of lottery winners who ultimately obtained Medicaid might be more prudent than those who won but did not follow through with subsequent enrollment requirements. Maybe the results are driven by such prudence rather than Medicaid itself. If so, they’d be biased estimates of effects of Medicaid.

    There is a subtle point* worth clarifying here, and one I should have made in discussion of the Oregon Medicaid study: Because the investigators used lottery winning status as an instrumental variable (IV), it is not correct to interpret the results as driven by some factor other than treatment. By definition the lottery (randomization) cannot be correlated with outcomes except through its effect on Medicaid (treatment) status. A way to think about this is that the proportion of prudent people who won the lottery is the same as the proportion of prudent people who did not. Prudence is working on both sides and, so, cannot bias estimates of treatment effects. What one obtains in an analysis of this type is an IV estimate called a “local average treatment effect” (LATE). It’s the average treatment effects over the subset of the population whose Medicaid status is affected by the instrument, the “compliers.”

    Now, it is correct to say that the compliers are different, perhaps more prudent, and that threatens generality of the findings. That’s why this is subtle. On one hand, one has a genuine treatment effect (the LATE). It’s not a “prudence” effect. It’s not a biased treatment effect. On the other hand, it’s not the effect of treatment on segments of the population not affected by the lottery, and they could be different. In other words, there are heterogeneous treatment effects, and the LATE estimate is just one of them (or an average of a subset of them).

    An ITT estimate is different, but that doesn’t make it more correct in general. It all depends on what question one is asking. Rubin offered some examples from the literature when one specifically would not want an ITT estimate:

    [I]n some settings [...] one may want to estimate the effect of treatment versus control on those units who received treatment. Two prominent examples come to mind. The first involves the effect on earnings of serving in the military when drafted following a lottery and the attendant issue of whether society should compensate those who served for possible lost wages (Angrist, 1990). The second example involves the effect on health care costs of smoking cigarettes for those who chose to smoke because of misconduct of the tobacco industry (Rubin, 2000).

    In both cases, an ITT estimate would dilute the very effect of import. It would not answer the specific question asked. This is related to the fact that one way of guaranteeing finding no program effect is by implementing a very large lottery relative to the number of treatment slots and then estimating the ITT effect. That’s a genuine limitation of ITT. There are others, as I noted in my ITT post. See also West and Thoemmes who wrote:

    However, Frangakis and Rubin (1999) and Hirano et al. (2000) showed that the ITT estimate can be biased if both nonadherence and attrition occur, and West and Sagarin (2000) raised concerns about replicating the treatment adherence process in different investigations.

    My broader point, which is consistent with most of Manzi’s approach in Uncontrolled, is that rather than promote one estimation methodology over another in general, we should identify and disclose the strengths and limitations of each. Experiments are not uniformly superior to nonexperimental methods. ITT is not uniformly superior to LATE. We can usefully employ a variety of approaches, provided we’re clear on the boundaries of their applicability. Unfortunately, this is not a common view. The mantra that “RCTs are the gold standard” is a bit glib; it’s not as helpful a guide through the methodological thicket as some take it to be. (This is not a critique of the book, which, as I wrote, is not as RCT-centric as some seem to believe.)

    Research won’t settle everything. Manzi concludes the book with many policy suggestions. The basic thrust of them is to support and conduct more research (including experiments) in social policy and to establish governmental institutions that would use the results of such work to inform policy change. At that level of generality, we are in agreement. I won’t get into details and quibbles here except to say that there is one thing that such an enterprise cannot do: establish which endpoints are important to study and upon which policy should turn.

    For instance, should we expand or contract Medicaid programs that are shown by randomized experiments to improve mental health and financial protection but do not provide conclusive results on physical health outcomes? Should we expand or contract jobs training programs that increase employment by, say, 5% and income by 3%? What if the numbers were 50% and 30%? Should we measure health outcomes of jobs programs? Should we measure employment outcomes of health programs? What if e-cigarettes cause 75% less lung cancer than regular cigarettes, should they be regulated differently? What if marijuana use leads to more alcohol consumption, should it be legalized? Should all (or more) studies of marijuana include that endpoint? What about the effect of marijuana use on tobacco use? On risky sex? On educational attainment? On income? What endpoints are important for policy? (These are all hypothetical examples.)

    My point* is that reasonable people could argue at length about what to measure, how to measure it, and what constitutes sufficient change in what measures to warrant broader policy intervention. Heck, we can argue about methods at great length: how and when to do power calculations, implications of attrition in RCTs, and validity of instruments, for example. Good observational research and experiments are important, but they’re just first steps. They don’t end debates. They don’t by themselves reveal how we ought to change policy. They only give us some partial indication of what might happen if we did so, and only where we choose to look.

    Read the book and give it some thought.

    * These are all points I’m making, not ones Manzi made in the book, unless I missed them.


    Comments closed