• Methods: There is no gold standard

    In “Instruments, Randomization, and Learning about Development,” Angus Deaton pulls no punches. He’s just as brutal, blunt, and precise about pitfalls and misuse of instrumental variables (IV) as randomized controlled trials (RCTs).

    I found insightful his emperor-has-no-clothes argument that the RCT is not deserving of its “gold standard” reputation, despite rhetoric to the contrary. I speculate RCTs have achieved their special status for several reasons:

    1. They are relatively conceptually simple, requiring less mathematical and statistical training than is required of many other methods. (Though, the basic explanation of them hides a lot of complexity, which leads to improper use and interpretation, as Deaton shows.)
    2. RCTs address the problem of confounding from unobservables (though this fact is not unique to RCTs), which, historically, has been a major impediment to causal inference in social sciences and in the advancement of medicine. (As Deaton explains, such confounding is not the only problem confronting empirical methods, and RCTs do not necessarily address the others better than nonexperimental methods.)
    3. RCTs lend themselves to a routinized enterprise of  evidence-based change (e.g., in medicine) in a way that other strong methods for causal inference do not (or not yet). Equivalently simple approaches that could be easily routinized offer far weaker support for causal inference. It is plausible to me that promotion of RCTs as the methodologically strongest approach to causality has spared us from many more studies of associations that can’t come even close to RCTs’ validity for causal inference, imperfect though it may be. It’s possible association-type studies could do a lot of damage to human welfare. (Evidence-based, pre-RCT medicine was pretty sketchy, for example.) This, perhaps, is the strongest moral justification for claiming that RCTs are “the gold standard,” even if they do not merit that unique standing: a world in which that is less widely believed could be much worse.
    4. Perhaps because of the forgoing features of RCTs, they have been adopted as the method of choice by high-powered professionals and educators in medical science (among other areas). When one is taught and then repeats that RCTs are “the gold standard” and one is a member of a highly respected class, that view carries disproportionate weight, even if there is a very good argument that it is not necessarily the correct view (i.e., Deaton’s, among others). Another way to say this is that the goldenness of RCTs’ hue should be judged on its merits of each application; we should be careful not to attribute to RCTs a goldenness present in the tint of glasses we’ve been instructed to wear.

    Let me be clear, Deaton is not claiming (nor am I) that some other method is better than RCTs. He is simply saying that there does not exist one method (RCTs, say) that deserves preferential status, superior to all others for all subjects and all questions about them. I agree: there is no gold standard.

    At the same time, applying some standards  in judging methodology is necessary. How this ought to be done varies by context. Official bodies charged with guarding the safety of patients (e.g., the FDA or the USPSTF) are probably best served with some fairly hard-and-fast rules about how to judge evidence. Too much room for judgement can also leave too much room for well-financed charlatans to sneak some snake oil through the gate.

    Academics and the merit review boards that judge their research proposals or the referees that comment on their manuscripts have more leeway. My view in this context is that a lot rides on the precise question one is interested in, the theoretical or conceptual model one (or the community of scholars) thinks applies to it, and the data available to address it, among other possible constraints. This is not a set-up for a clean grading system; there’s no substitute for expertise and opinions will vary. These are major limitation of the acceptance that there is no hierarchy to quality of methodology, in general.

    Below are my highlights from Deaton’s paper, with my emphasis added. Each bullet is a direct quote.

    On IV

    • [Analysts] go immediately to the choice of instrument [], over which a great deal of imagination and ingenuity is often exercised. Such ingenuity is often needed because it is difficult simultaneously to satisfy both of the standard criteria required for an instrument, that it be correlated with [treatment] and uncorrelated with [unobservables affecting outcomes]. [...] Without explicit prior consideration of the effect of the instrument choice on the parameter being estimated, such a procedure is effectively the opposite of standard statistical practice in which a parameter of interest is defined first, followed by an estimator that delivers that parameter. Instead, we have a procedure in which the choice of the instrument, which is guided by criteria designed for a situation in which there is no heterogeneity, is implicitly allowed to determine the parameter of interest. This goes beyond the old story of looking for an object where the light is strong enough to see; rather, we have at least some control over the light but choose to let it fall where it may and then proclaim that whatever it illuminates is what we were looking for all along.
    • Angrist and Jorn Steffen Pischke (2010) have recently claimed that the explosion of instrumental variables methods [] has led to greater “credibility” in applied econometrics. I am not entirely certain what credibility means, but it is surely undermined if the parameter being estimated is not what we want to know.
    • Passing an overidentification test does not validate instrumentation. [Here's why.]

    On RCTs

    • The value of econometric methods cannot and should not be assessed by how closely they approximate randomized controlled trials. [...] Randomized controlled trials can have no special priority. Randomization is not a gold standard because “there is no gold standard” []. Randomized controlled trials cannot automatically trump other evidence, they do not occupy any special place in some hierarchy of evidence, nor does it make sense to refer to them as “hard” while other methods are “soft.” These rhetorical devices are just that; metaphor is not argument, nor does endless repetition make it so.
    • One immediate consequence of this derivation is a fact that is often quoted by critics of RCTs, but often ignored by practitioners, at least in economics: RCTs are informative about the mean of the treatment effects [] but do not identify other features of the distribution. For example, the median of the difference is not the difference in medians, so an RCT is not, by itself, informative about the median treatment effect, something that could be of as much interest to policymakers as the mean treatment effect. It might also be useful to know the fraction of the population for which the treatment effect is positive, which once again is not identified from a trial. Put differently, the trial might reveal an average positive effect although nearly all of the population is hurt with a few receiving very large benefits, a situation that cannot be revealed by the RCT.
    • How well do actual RCTs approximate the ideal? Are the assumptions generally met in practice? Is the narrowness of scope a price that brings real benefits or is the superior ity of RCTs largely rhetorical? RCTs allow the investigator to induce variation that might not arise nonexperimentally, and this variation can reveal responses that could never have been found otherwise. Are these responses the relevant ones? As always, there is no substitute for examining each study in detail, and there is certainly nothing in the RCT methodology itself that grants immunity from problems of implementation.
    • In effect, the selection or omitted variable bias that is a potential problem in nonexperimental studies comes back in a different form and, without an analysis of the two biases, it is impossible to conclude which estimate is better—a biased nonexperimental analysis might do better than a randomized controlled trial if enrollment into the trial is nonrepresentative. 
    • Running RCTs to find out whether a project works is often defended on the grounds that the experimental project is like the policy that it might support. But the “like” is typically argued by an appeal to similar circumstances, or a similar environment, arguments that depend entirely on observable variables. Yet controlling for observables is the key to the matching estimators that are one of the main competitors for RCTs and that are typically rejected by the advocates of RCTs on the grounds that RCTs control not only for the things that we observe but things that we cannot. As Cartwright notes, the validity of evidence-based policy depends on the weakest link in the chain of argument and evidence, so that by the time we seek to use the experimental results, the advantage of RCTs over matching or other econometric methods has evaporated. In the end, there is no substitute for careful evaluation of the chain of evidence and reasoning by people who have the experience and expertise in the field. The demand that experiments be theory-driven is, of course, no guarantee of success, though the lack of it is close to a guarantee of failure. 

    The paper is very readable, though I skipped (or lightly skimmed) a middle section that did not appear to have a high density of general advice, if any. There’s some math, but it’s simple and, in some places, important for understanding key points, including a few I quoted above. Only in one or two spots did I find the words insufficient to understand the meaning. Perhaps they were a bit too efficient. Find the paper, ungated, here.

    @afrakt

    Share
    Comments closed
     
  • Methods: There are bad IV studies, but IV is not bad

    In a post last week—which you should read if you don’t know what an “instrumental variable” (IV) is—I described the key assumption for a good IV: it’s uncorrelated with any unobservable factor that affects the outcome. “Unobservable” is a term of art. Practically speaking, it actually means anything not controlled for in the system of equations used to produce the IV estimate.

    Example: Consider a study of the effect of cardiac catheteriation in heart attack patients on mortality in which the investigators employ the following IV: the difference in distance between (a) each patient’s residence and the nearest hospital that offers cardiac catheterization and (b) each patient’s residence and the nearest hospital of any type. (See McClellan, McNeil, and Newhouse, for example.) Is this differential distance a good instrument? Without controlling for other factors, probably not.

    It’s no doubt true that the closer patients live to a hospital offering catheterization relative to the distance to any hospital, the more likely they are to be catheterized. But hospitals that offer catheterization may exist in areas that are disproportionately populated by patients of a certain type (e.g., more urban areas where patients have a different racial mix than elsewhere) and, in particular, of a type that experiences different outcomes than others. To the extent the instrument is correlated with observable patient factors (like race) that affect outcomes, the IV can be saved. One just controls for those factors by including them as regressors in the IV model.

    The trouble is, researchers don’t always include the observable factors they should in their IV specifications. For instance, in the above example if race is correlated with differential distance and mortality, leaving it out will bias the estimate of the effect of catheterization on mortality. This is akin to an RCT in which treatment/control assignment is partially determined by a relevant health factor that isn’t controlled for. Bad, bad, bad. (But, this can happen, even in an RCT designed and run by good people.)

    In a recent Annals of Internal Medicine paper, Garabedian and colleagues found substantial problems of this type in a large sample of IV studies. They examined 65 IV studies published before 2012 with mortality as an outcome and that used any of the four most common instrument types: facility distance (e.g., differential distance), or practice patterns at the regional, facility, or physician level. Then they scoured the literature to find evidence of confounding for these instruments and categorized the observable factors that, if not controlled for, could invalidate them. Finally, they tabulated the proportion of studies that controlled for various potential confounders. That table is below.

    IV confounding

    Some confounders, like patient comorbidites, are often included as controls. Others, like education or procedure volume are much less commonly included. This is worrisome. Here are the authors’ final two sentences:

    [1] Any instrumental variable analysis that does not control for likely instrument–outcome confounders should be interpreted with caution.

    This is excellent advice. I agree.

    [2] Although no observational method can completely eliminate confounding, we recommend against treating instrumental variable analysis as a solution to the inherent biases in observational CER [comparative effectiveness research] studies.

    My emphasis is added, because this is devastating and, I think, inconsistent with the paper’s contribution. I cannot agree with it. Here’s why:

    • It’s one thing to find that many specific IV studies are poorly done, but it’s quite another to suggest IV, in general, should not play a role in addressing bias in observational CER. Indeed, there are many poor studies using any technique. There are bad propensity score studies. There are flawed RCTs. Does that make these inappropriate techniques for CER?
    • Let’s consider the alternatives: Propensity scores do not address confounding from unobservables, so they can’t be a complete solution all CER problems. RCTs are impractical in many cases. And even when they’re not, it’s a good idea to do some preliminary observational studies using methods most likely to credibly estimate causal effects. It seems to me we need IVs precisely because of these limitations with other techniques. (And we need the other techniques as well.)
    • The authors identified precisely how to do more credible IV studies. Shouldn’t we use that knowledge to do IVs that can better address selection bias instead of concluding we cannot?
    • Not every confounder the authors identified need be included in every IV study. Just because a factor might confound doesn’t mean it does confound. This is important because some factors aren’t readily available in research data (e.g., race, income, or education can be missing from the data—in many cases they can be included at an area level, however). To reassure ourselves that no important factors are among the unobservables, one can perform some falsification tests, which are very briefly mentioned by the authors. This validity check deserves far more attention, and is beyond the scope of this post.

    The Garabedian paper is a tremendous contribution, an instant classic. I applaud the authors for their excellent and laborious work. Anybody who is serious about IV should read it and heed its advice … right up until the last sentence. It’s possible to do good IV. The paper shows how and then, oddly, walks away from it.

    @afrakt

    Share
    Comments closed
     
  • Methods: A little bit about instrumental variables

    Though there are lots of sources to learn about instrumental variables (IV), in this post I’ll point to three papers I found particularly helpful.

    I’ve already written a tutorial post on IV, based on a paper by my colleague Steve Pizer. Two diagrams from that paper make clear that IV is a generalization of randomized controlled trials (RCTs). Conceptually, an RCT looks like this:

    RCT

    Randomization (e.g., by the flip of a coin) ensures that the characteristics of patients in the treatment and comparison groups have equal expected values. The two groups are drawn from the same sample of recruits and the only factor that determines their group assignment is the coin flip, so, apart from the treatment itself, all other differences between the groups must by construction be random.

    An IV study could look like the diagram below. Notice that if you ignore the patient and provider characteristics boxes on the left and the lines that emanate from them and interpret the institutional factors box at the bottom as a coin flip, this looks exactly like an RCT.

    obs

    [In an IV study, a] large number of observed and unobserved factors [could] influence sorting into treatment and comparison groups. Many of these factors are also independently associated with differences in the outcome. These relationships are illustrated by the solid arrows connecting observed and unobserved patient and provider characteristics to sorting and the dashed arrows connecting these same characteristics directly to the outcome. The arrows directly to the outcome are dashed because these relationships are not the ones of primary interest to the investigator; in fact, these are potentially confounding relationships that could make it difficult or impossible to accurately measure the effect of treatment.

    What makes an IV study an IV study is the analytical exploitation of some “institutional factors” (e.g., laws, programs) or other measurable features of the world—called instrumental variables—that affect sorting into treatment and control groups, at least somewhat, and are, arguably,* not correlated with any unobservable patient or provider factors that also affect outcomes. That’s kind of a mind-bender, but notice that an RCT’s coin flip has these properties: it’s a measurable feature of the world, affects sorting, and is not correlated with any unobservable patient or provider factors. Other things in the world can, arguably,* act like a coin flip, at least for some patients: program eligibility that varies geographically (like that for Medicaid), for example.

    The algebraic expression of the forgoing difference between RCTs and IV studies by Katherine Harris and Dahlia Remler may also be informative to you (if you’re not math-averse). They consider patient i‘s health outcome, yi, given by

    [1]     yi = β(hi)di + g(hi) + εi

    where hi denotes unobservable health status; di is a dichotomous variable that takes the value one if the patient is treated and zero otherwise; β(hi) + g(hi) is the expected health outcome if the patient receives the treatment; g(hi) is the expected health outcome if the patient does not receive the treatment; and εi represents the effect of other unobserved factors unrelated to health status. The effect of treatment for each individual is the difference between health outcomes in the treated and untreated state, β(hi). If treatment effects are homogenous, then β(hi) = β for everyone. If treatment effects are heterogeneous, then β(hi) is different for [at least some patients].

    Next, the probability that patient i receives treatment can be written

    [2]     P(di=1) = f(hi) + zi

    where f(hi) represents health status characteristics that determine treatment assignment, and zi represents factors uncorrelated with health status that have a nontrivial impact on the probability of receiving treatment.

    A potential problem in this setup is that treatment and outcome depend on health status hi, which is unobservable. If unobservably sicker people are treated and are also more likely to have a bad outcome (because they are sicker), that will bias our judgment of the effect of treatment. The way out is to find or manufacture a zthat determines treatment assignment for at least some patients in a way that is uncorrelated with unobservable health h(as well as uncorrelated with other unobservable factors that affect treatment, εi.)

    In experimental settings, researchers strive to eliminate the effect of health status on the treatment assignment process shown in Equation 2 by randomly generating (perhaps in the form of a coin-flip) values of zi such that they are uncorrelated with health status and then assigning subjects to treatment and control groups on the basis of its value. [...]

    In some nonexperimental settings, it may be possible to identify one or more naturally occurring  zi, [IVs] that influence treatment status and are otherwise uncorrelated with health status. When this is the case, it is possible to estimate a parameter that represents the average effect of treatment among the subgroup of patients in the sample for whom the IV determines treatment assignment.

    Harris and Remler go on to discuss more fully (with diagrams!) the subgroup of patients to which an IV estimate (aka, the local average treatment effect or LATE) applies when treatment effects are heterogeneous. With Monte Carlo simulations, they show that LATE estimates can differ considerably from the average treatment effect (ATE) one would obtain if one could estimate it for the entire population. Their explanation is beautiful and well worth reading, but too long for this post.

    I’ll conclude with a shout out to one more worthwhile IV tutorial paper: the MDRC Working Paper “Using Instrumental Variables Analysis to Learn More from Social Policy Experiments,” by Lisa Gennetian, Johannes Bos, and Pamela Morris. As with the Pizer and Harris/Remler papers, it’s worth reading in full.

    * “Arguably” because one needs to provide an argument for the validity of an instrumental variable. This is a mix of art and science, well beyond the scope of this post. I will come back to this in the future.

    Share
    Comments closed
     
  • Biased OLS vs. contaminated IV?

    If you’re in the observational study business, this, by Anirban Basu and Kwun Chan, looks potentially useful:

    In the outcomes research and comparative effectiveness research literature, there are strong cautionary tales on the use of instrumental variables (IVs) that may influence the newly initiated to shun this premier tool for casual inference without properly weighing their advantages. It has been recommended that IV methods should be avoided if the instrument is not econometrically perfect. The fact that IVs can produce better results than naïve regression, even in nonideal circumstances, remains underappreciated. In this paper, we propose a diagnostic criterion and related software that can be used by an applied researcher to determine the plausible superiority of IV over an ordinary least squares (OLS) estimator, which does not address the endogeneity of a covariate in question. Given a reasonable lower bound for the bias arising out of an OLS estimator, the researcher can use our proposed diagnostic tool to confirm whether the IV at hand can produce a better estimate (i.e., with lower mean square error) of the true effect parameter than the OLS, without knowing the true level of contamination in the IV.

    @afrakt

    Share
    Comments closed
     
  • Bias and the Oregon Medicaid study

    There’s been some chatter about how the Oregon Medicaid study is or might be biased. That’s worth a post!

    There’s a precise way in which the study is not biased. By design it estimated the effect of Medicaid on those who won the lottery and enrolled, relative to those who lost the lottery and did not. This estimate is unbiased for the contrast between precisely these two groups, but not necessarily for others. In econometric jargon, this is known as the “local average treatment effect” (LATE). The “treatment effect” part of “LATE” is clear, but what’s this “local average” business?

    Sigh. I hate this terminology. It’s supposed to evoke the idea that the instrument (the lottery in this case) doesn’t have a “global” effect on study participants, causing all randomized to Medicaid (lottery winners) to be on and all those randomized to control (lottery losers) to not be. It has a more modest, “localized” effect. The other jargon used for this is that the LATE estimate is an estimate of the effect of treatment on “compliers.” That’s a more meaningful term to me. The compliers are those that do what randomization “tells” them to do, they enroll in Medicaid if randomized to do so and they don’t if not.

    Of course, you can’t expect full compliance in this study (or many other RCTs) because some lottery winners turned out to be ineligible for Medicaid by the time they were permitted to enroll. Some had too high income. Some moved out of state. Some may have found other sources of coverage. (You had to have income below 100% FPL, live in state, and uninsured for 6 months to be permitted to enroll.) Also, enrollment wasn’t mandatory. So, if you just decided it wasn’t worth the trouble or didn’t receive or notice the letter inviting enrollment, you might have missed the window (45 days is all they gave you).

    On the flip side, nobody was preventing lottery losers from enrolling on Medicaid if they became eligible in another way. The study pertained only to the expansion of Medicaid beyond the statutory requirements. If people ended up in one of the eligible categories (aged, blind, disabled, pregnant) they could get on Medicaid.

    So, there was considerable “crossover” (lottery losers enrolling in Medicaid, lottery winners not) or “contamination” or “noncompliance,” all jargon for the same thing. This was not a perfect RCT. Few are.

    What to do? The investigators did two things. First, they considered an “intent-to-treat” (ITT) approach, comparing lottery winners to losers no matter whether they enrolled in Medicaid or not. These results are in their first year paper. I’ve forgotten what they say specifically, though in general they’re much smaller effects than the LATE results. The concern with ITT is that all this crossover biases the results toward zero. There isn’t as much contrast between study arms due to noncompliance.

    Next, the investigators provided LATE estimates, about which I wrote above. These are unbiased for contrast among compliers. In this study, they’re about four times the size of the ITT estimates by virtue of the mathematics (“instrumental variables“) of LATE. But they need not be the same as one would find in the absence of noncompliance. There may be bias in that sense. Why?

    • Hypothesis 1: Those who took the trouble to enroll in Medicaid were sicker than those who didn’t. After all, why enroll if you don’t need it? Remember, even some lottery losers (18.5% of them) enrolled in Medicaid. The LATE estimate removes the effect of them since they are noncompliers. Also, some lottery winners didn’t enroll (most of them didn’t) and the LATE estimate removes their effect too. What’s left under this hypothesis is a comparison of relatively sicker people who did enroll in Medicaid with relatively healthier people who didn’t. The investigators actually found some evidence to suggest that Medicaid enrollees are sicker. Many other studies find that Medicaid enrollees are sicker to the point that some studies find an association of Medicaid with increased mortality. Under hypothesis 1, results are biased downward relative to what they would be under full compliance. Medicaid looks less effective than it might otherwise be. 
    • Hypothesis 2: Those who are more organized, better planners, with higher cognitive function and literacy (including health) skills enroll. It takes some awareness and planning to enroll, so there is some face validity to this argument. I’m aware of no evidence to support it though. (Got any?) Under this hypothesis Medicaid enrollees would do a better job of getting and staying healthy even apart from whatever Medicaid does for them. This would bias results toward showing a larger Medicaid effect than would be true in general (under full compliance).

    There may be other hypothetical sources of bias. The point I’d make about all of them is that we don’t know whether any of these biases actually exist and, if they do, how big an effect they have. It’s all speculation. Still, LATE is an unbiased (and causal) estimate of the effect of Medicaid on compliers. It does filter out some who want to be on Medicaid and can’t enroll (lost lottery, no other route) and filters out some who enroll but weren’t invited (lost lottery but became eligible another way). Some of these noncompliers could be unusually sick. Some noncompliers could be unusually organized and aware. LATE filters some of them out.

    Some might wonder about another type of estimate one could do, the effect of “treatment on the treated.” Here one just compares Medicaid enrollees to non-enrollees, ignoring the lottery draw. Unfortunately, this just exacerbates whatever bias might exist. There is no random assignment at play here. There’s no filtering for selection at all. You get an association, not a causal estimate. This is the problem with many studies of Medicaid and insurance. Randomness is key. The lottery should be exploited in some fashion (either ITT or LATE).

    Lastly, notice how complicated RCT interpretation is? Yes, it’s the gold standard, but it still has issues. Using an IV approach for a LATE estimate is, in my view, about the best you can do. But there may be bias when considering generalizing the findings outside the “local” effect of the instrument (lottery or random assignment). These concerns arise with any IV study. In this sense, IV and RCT are much closer cousins than one tends to think. Disparage one and you disparage the other.

    Not all that’s gold glitters, but it is still valuable.

    @afrakt

    Share
    Comments closed
     
  • More instrumental variables studies of cancer treatment

    The study I wrote about earlier this week by Hadley et al. is just one of many to apply instrumental variables (IV) to analysis of cancer treatment (prostate in that case). Zeliadt and colleagues do so as well (also for prostate cancer) and cite several others. Both the Hadley and Zeliadt studies exploit practice pattern variation, specifically differences in prior year(s) rates of treatment across areas, to define IVs.

    For you to buy the results, you have to believe that lagged treatment rates strongly predict actual treatment (this can be shown) and, crucially, are not otherwise correlated outcomes, controlling for observable factors (this mostly requires faith). I would not believe the IVs valid if there were clear, accepted standards about whether and what treatment is best. If that were so, then treatment rates could be correlated with quality, broadly defined. Higher quality care might be expected in areas that follow the accepted standard more closely. Better outcomes could be do to broadly better care, not just to the particular treatment choice.

    However, in prostate cancer, there is no standard about what treatment is best. I accept the IVs as valid in this case.

    Among the other cancer treatment IV studies I found, some of which Zeliadt cites, several also exploit practice pattern variations:

    • Yu-Lao et al.: Again, prostate cancer, and, notably, appearing in JAMA. Yes, JAMA published an IV study based on practice pattern variation. More on why I am excited about that below.
    • Brooks et al.: Breast cancer
    • Earle et al.: Lung cancer

    I cannot say whether practice patterns make for valid IVs for breast and lung cancer at the time the Brooks and Earle studies were published. I’d have to think about it, and I have not. I merely note that exploiting practice pattern variation for IV studies is not novel, though it is not widely accepted either, particularly in medical journals. I think it should be, though only for cases for which a good argument about validity can be made, as I believe it can be for prostate cancer and, I am sure, some other conditions.

    Of course I would prefer to see more randomized controlled trials (RCTs) on all the areas of medicine in need of additional evidence. But those areas are, collectively, a massive territory. We neither have the time nor have we demonstrated a willingness to spend the money required to conduct RCTs in all areas. We have to prioritize. For cases for which IV studies are likely to be reasonably valid, we ought to apply the technique, not necessarily instead of an RCT — though with resource constraints, such an argument could be made — but certainly in advance of one.

    IV studies are cheaper, faster, and offer other advantages. They don’t require enrollment of patients. They can exploit the large, secondary data sets already in existence (Medicare, Medicaid, VA, commercial payers, hospital systems, and the like). As such, they permit stratification by key patient demographics that RCTs are often underpowered to support. Even when an RCT is warranted, a good IV study done in advance can help to refine questions and guide hypotheses.

    Given the vast need for evidence that overwhelms our capacity to provide it via RCTs, there isn’t a good argument for not doing IV studies in cases for which they justifiably valid. However, part of the package of scaling up an IV research agenda is publishing the findings in top journals — not just health economics journals, but also top medical journals like JAMA. This will require more clinical reviewers of manuscripts to gain comfort with the IV approach (start here). It will also require medical journals to solicit reviews by those who can vouch for instruments’ validity or point out when they are unlikely to be so.

    It’s hard and expensive to create purposeful randomness, as is required in an RCT. Yet, there is so much natural randomness around. We should be exploiting it. Good quasi-randomness is a terrible thing to waste.

    @afrakt

    Share
    Comments closed
     
  • What drives choice of prostate cancer treatment?

    I thought I had blogged on this paper before, but I can’t find a prior post. So, here are some quotes and brief comments on Zeliadt, S. B., Ramsey, S. D., Penson, D. F., Hall, I. J., Ekwueme, D. U., Stroud, L., & Lee, J. W. (2006). Why do men choose one treatment over another? Cancer, 106(9), 1865-1874.

    In the largest study we reviewed, which involved 1000 patients, approximately 42% of patients defined an effective treatment as one that extended expected survival or delayed disease progression, whereas 45% indicated that effectiveness meant preservation of quality of life (QOL).5 This is in contrast to physicians, 90% of whom defined effectiveness as extending expected survival. In another study, fewer than 20% of patients ranked either “effect of treatment on length of life” or “chances of dying of cancer” as 1 of the 4 most important factors in making a decision.26 In 1 study of health state preferences, 2 of 5 men were unconditionally willing to risk side effects for any potential gain in life expectancy.64 These studies suggest that there is substantial variation in the significance that patients place on cancer eradication, and that treatment efficacy means more than “control” of the tumor for many patients.

    Concerns regarding cancer eradication appear to correlate directly with aggressiveness of therapy, with radical prostatectomy being the choice preferred by the majority of patients who focus on cancer control.

    So, concerns about cancer relate to treatment choice. To the extent that those attitudinal factors also relate to outcomes (e.g., through their relationship with care for other conditions), they are good candidates for unobserved factors that, in part, explain the difference between results of instrumental variables (or RCT) and other observational study techniques.

    Side effects like incontinence and impotence are frequently cited concerns, as reported in the paper. However,

    To our knowledge, there is limited information available regarding how men balance side effects in making their treatment decision. For example, although preservation of sexual function was rated as very important by 90% of men age younger than 60 years, and 79% of men age 75 years and older, in a separate question only 3% of these same men indicated that “having few side effects” was the most important consideration in initiating therapy.5 Fear of side effects was also stated by only 3% of men in a study in North Carolina, in which the majority of patients were black.8 Srirangam et al.45 reported that although 55% of spouses reported that side effects were important, only 6% indicated that side effects were deciding factors. One study comparing surgery and brachytherapy reported that 25% of patients chose between these 2 options based on the side effect profile.9 In addition, although Holmboe and Concato10 found that 49% of patients were concerned with incontinence and 38% were concerned with impotence, only 13% reported weighing the risks and benefits of treatment. These studies demonstrate the apparent disconnect between patients’ stated importance of side effects and the role that they actually play in reaching the final treatment decision.

    Ultimately, it’s what patients actually do, not what they say, that matters. Therefore, side effects may be less of a relevant factor in treatment decisions than is commonly believed. Put another way, that something is a concern doesn’t imply that it changes one’s decision. That does not take away from the fact that concerns are psychologically important.

    Less frequently discussed are concerns about other potential complications.

    Fear of surgical complications was emphasized by some men who selected watchful waiting. 7 A different study found that complications due to surgery were of concern to 12% of patients when considering surgery.3 A belief that radiation is harmful rather than therapeutic was offered by some men who selected surgery.44 When considering radiation therapy, 21% of men indicated concern about skin burns.3 Long recovery times were cited by 17% of patients.10 For a small percentage of men, issues such as fear of surgery or radiation appeared to be the primary factor in their decision regarding treatment.

    One reason complications and side effects may play a relatively small role in treatment decisions is that physicians are playing a large role in influencing those decisions.

    The role of the physician recommendation has received considerable attention in prostate cancer decision making due to the widely recognized preferences held by each physician specialty. As might be expected, opinions regarding the optimal treatment for localized prostate cancer vary among urologists, radiation oncologists, oncologists, and general practitioners. Urologists nearly universally indicate that surgery is the optimal treatment strategy, and radiation oncologists similarly indicate that radiation therapy is optimal.78

    To the extent that treatment choice is driven by physicians and is otherwise unrelated to outcomes, it suggests an opportunity for a valid instrument. This is what Hadley, et al. exploited.

    The paper continues with an exploration of the role of family members, race, socioeconomic, and psychological factors in treatment choice. Some of these (e.g., family relationships, socioeconomic factors, psychological factors) are likely to be incompletely observed and are, therefore, additional possible reasons why instrumental variables and RCT results differ from naive, observational studies. They key is that they may also be related to outcomes. It’s not too hard to imagine they could be.

    @afrakt

    Share
    Comments closed
     
  • Methods for comparative effectiveness of prostate cancer treatments

    A research notebook entry on an important paper follows. I’ve left out quite a bit that is more tutorial. So, the paper is more accessible than it may seem.

    All quotes from Hadley, J., Yabroff, K. R., Barrett, M. J., Penson, D. F., Saigal, C. S., & Potosky, A. L. (2010). Comparative effectiveness of prostate cancer treatments: evaluating statistical adjustments for confounding in observational data. Journal of the National Cancer Institute, 102(23), 1780-1793:

    We selected 14,302 early-stage prostate cancer patients who were aged 66–74 years and had been treated with radical prostatectomy or conservative management from linked Surveillance, Epidemiology, and End Results–Medicare data from January 1, 1995, through December 31, 2003. Eligibility criteria were similar to those from a clinical trial used to benchmark our analyses. Survival was measured through December 31, 2007, by use of Cox proportional hazards models. We compared results from the benchmark trial with results from models with observational data by use of traditional multivariable survival analysis, propensity score adjustment, and instrumental variable analysis.

    This is an important exercise. Here’s why:

    The randomized controlled trial is considered the most valid methodology for assessing treatments’ efficacy. However, randomized controlled trials are costly, time consuming, and frequently not feasible because of ethical constraints. Moreover, some randomized controlled trial results have limited generalizability because of differences between randomized controlled trial study populations, who may be screened for eligibility on the basis of age and comorbidities, and community populations, who are likely to be much more heterogeneous with regard to health conditions and socioeconomic characteristics.

    To this, add that RCTs are often under-powered for stratification by key patient characteristics. This is where observational studies shine. Of course, biased selection is the principal concern with observational studies.

    Patient selection into specific treatments is an important consideration in all observational studies, but particularly for those in prostate cancer, because incidence is highest in the elderly who are also most likely to have multiple comorbidities.

    Observational study techniques are not equivalent in their ability to address the selection problem.

    Observational studies (1,11–13) have previously used traditional regression and propensity score methods to evaluate associations between specific prostate cancer treatments with survival. In these studies, the propensity score methods did not completely balance (ie, equalize) important patient characteristics such as tumor grade, size, and comorbidities across treatment groups. Furthermore, patients who received active treatment had better survival for noncancer causes of death than patients who received conservative management, indicating that unobserved differences between groups affected both treatment choice and survival.

    Instrumental variable analysis is a statistical technique that uses an exogenous variable (or variables), referred to as an “instrument,” that is hypothesized to affect treatment choice but not to be related to the health outcome (14–17). Variations in treatment that result from variations in the value of the instrument are considered to be analogous to variations that result from randomization and so address both observed and unobserved confounding. Instrumental variable analysis has been used with observational data to investigate clinical treatment effects among patients with breast cancer (18–20), lung cancer (21), or prostate cancer (5,22).

    The study findings support the use of instrumental variables.

    Propensity score adjustments resulted in similar patient characteristics across treatment groups, and survival was similar to that of traditional multivariable survival analyses. The instrumental variable approach, which theoretically equalizes both observed and unobserved patient characteristics across treatment groups, differed from multivariable and propensity score results but were consistent with findings from a subset of elderly patient with early-stage disease in the randomized trial.

    The authors’ preferred instrument captures practice pattern variation.

    We constructed the primary instrumental variable for treatment received by use of a two-step process. First, we used the entire dataset (n = 17,815) to estimate the probability of receiving conservative management as a function of patients’ clinical characteristics (tumor stage and grade, NCI comorbidity index, and Medicare reimbursements for medical care in the previous year), demographics (age, race, ethnicity, and marital status), year of diagnosis, and all possible interactions among these variables. Second, we calculated the difference between the actual proportion of patients receiving conservative management and the average predicted probability of receiving conservative management (generated from the logistic regression model) in each hospital referral region by year. Areas with relatively large positive differences between the actual and predicted proportions of patients receiving conservative management favor a conservative management treatment pattern, and areas with large negative differences between the actual and predicted proportions of patients receiving conservative management favor a radical prostatectomy treatment pattern. We then lagged this measure of the local area treatment pattern by 1 year and linked it to each patient in the analysis to enhance the instrument’s independence from patients’ current health and unobserved characteristics.

    Statistical methods:

    Treatment propensity (ie, the predicted probability of receiving conservative management) for the propensity score analysis and for constructing the lagged area treatment pattern for the instrumental variable analysis was estimated with logistic regression. The survival models were estimated with Cox proportional hazard models. Visual inspection of the parallelism of the Kaplan–Meier plots of the logarithms of the estimated cumulative survival models by treatment supported the proportionality assumption. The instrumental variable version of the Cox hazard model was estimating with the two-stage residual inclusion method (38), which has been shown to be appropriate for nonlinear outcome models. [...]

    [The instrument's] independence of the survival outcomes was confirmed by its lack of statistical significance as an independent variable in an alternative version (data not shown) of the Cox survival models.

    One acknowledged limitation, among many, is that PSA values were not available to the researchers. Another is that

    a complete statistical assessment of the Cox hazard model’s proportionality assumption indicated that the effects of some covariates may not be time invariant, especially in the analysis of all-cause mortality. Although a sensitivity analysis of the effects of allowing time-varying covariates did not alter the principle findings with regard to treatment effects, further analysis of time-varying effects may be warranted.

    All in all, a very nice paper. It’s worth a full read by observational researchers.

    @afrakt

    Share
    Comments closed
     
  • Good quasi-randomness is hard to find

    What now seems like ages ago (but was only in early April), Joe Doyle and colleagues published an NBER paper finding, in their words, that “higher-cost hospitals have significantly lower one-year mortality rates compared to lower-cost hospitals.” The paper has already been discussed in the blogosphere. See Sarah Kliff’s first and second postsThom Walsh, and David Dranove, for example.

    Given all that commentary, which you are perfectly capable of reading along with the abstract, there’s little point in me describing or critiquing the results. Instead I’ll focus on some aspects of the methods.

    With acknowledgement of their imperfections, randomized trials are considered the gold standard for causal inference. One of the fundamental problems with studying the spending-outcomes relationship, however, is that we can’t randomize individuals to spending levels or even to hospitals. Instead, we must rely on the data we observe. If we’re clever, we can find something that is almost like randomizing patients to hospitals (or EDs), though. In this regard, Doyle and colleagues were extremely clever.

    We consider two complementary identifcation strategies to exploit variation in ambulance transports. The first uses the fact that in areas served by multiple ambulance companies, the company dispatched to the patient is effectively random due to rotational assignment or even direct competition between simultaneously dispatched competitors. Moreover, we demonstrate that ambulance companies serving the same small geographic area have preferences in the hospital to which they take patients. These facts suggest that the ambulance company dispatched to emergency patients may serve as a random assignment mechanism across local hospitals.

    Our second strategy localizes the “natural randomization” approach adopted by the Dartmouth researchers by exploiting contiguous areas on opposite sides of ambulance service area boundaries in the state of New York. In New York, each state-certified Emergency Medical Service (EMS) provider is assigned to a territory via a certificate of need process where they are allowed to be “first due” for response. Other areas may be entered when that area’s local provider is busy. We obtained the territories for each EMS provider from the New York State Department of Emergency Medical Services, and we couple these data with unique hospital discharge data that identifies each patient’s exact residential address. This combination allows us to compare those living on either side of an ambulance service area boundary. To the extent that these neighbors are similar to one another, the boundary can generate exogenous variation in the hospitals to which these patients are transported.

    If this doesn’t fill you with admiration you’re probably not an economist. In that case, trust me, they have found an exceptionally good source of quasi-randomness in patient assignment.

    Not long after I read this, I noticed this bit in a post by Jordan Rao about a recent paper by Emily Carrier, Marisa Dowling, and Robert Berenson:

    The paper, published in Health Affairs, found hospitals “wooing” EMS workers that service well-off neighborhoods, even sprucing up the rooms where the workers rest and fill out paperwork.

    This is a new phenomenon and, therefore, doesn’t detract from my admiration for Doyle et al.’s work, which focused on the early-to-mid 2000s. I raise the issue of hospitals trying to attract EMS workers from more affluent areas to suggest that in the future, an approach like Doyle et al.’s may have to address this type of thing. To the extent certain hospitals preferentially choose patients (e.g., more affluent ones) by influencing EMS workers, it is possible ambulance transports do not serve as a random assignment of patients to hospitals. What if higher spending hospitals are also the ones that play this game, attracting a more wealthy set of patients? If that were the case, it is likely that there are other unobservable characteristics of those patients that are correlated with outcomes. That would be a source of bias.

    This raises a more general point about the ambulance transport approach. It only addresses demand-side selection. The patients are (quasi-) randomly assigned to hospitals, in a way (potentially) not correlated with hospital spending. But that does not mean there aren’t unobservable (non-random) aspects of hospitals that are correlated with spending and outcomes. The quasi-randomness of ambulance transport does not address this supply-side selection.

    Good quasi-randomness is hard to find. Doyle et al. found some. Still, it doesn’t address every source of bias, nor should anyone expect as much from any study, even randomized experiments.

    @afrakt

    Share
    Comments closed
     
  • Overidentification tests

    Last week, in Inquiry, my latest paper with Steve Pizer and Roger Feldman was published. An ungated, working paper version is also available. Note also that I wrote a bit about a portion of it in a prior post, though even that does not describe what the paper is about.  I’ll write more about the results in the paper in another post. If you can’t wait, click through for the abstract. For now, I want to focus on another technical detail, which is likely to interest all of five readers. You know who you are from the title of the post.

    Until fairly recently, my colleagues and I thought overidentification tests of instruments were worth doing. We no longer feel that way. Still, in order to be published, we have little choice but to do them when a reviewer demands them, but we still think they’re not very valuable.

    Though these are typically discussed as tests of excludability, they are, in fact, joint tests of excludability and homogeneity of treatment effects (Angrist 2010). Consequently, instruments that are excludable may be rejected due to local average treatment effects.

    Passing overid tests may convince some reviewers that one’s instruments are excludable from the second stage model, but it shouldn’t. Failing to pass doesn’t prove they are not. This is a rather weak case for their scientific value. Many papers in top economics journals using IV methods do not include overid tests. That’s just fine.

    “Angrist 2010″ is a personal communication with Josh Angrist.

    @afrakt

     

    Share
    Comments closed