A new study suggests that a common medication for type 2 diabetes causes more harm than previously thought, including increasing avoidable hospitalizations and mortality, relative to an alternative medication. The study’s methods are as interesting as its findings.
Thinking that those unable to access the full study might want to know a bit more about it, lead author Julia Prentice sent me a presentation that includes some details. You’ll find that here.
The study about which I wrote below is ungated at this link until Feb. 28, 2015.
A new study suggests that a common medication for type 2 diabetes causes more harm than previously thought, including increasing avoidable hospitalizations and mortality, relative to an alternative medication. The study’s methods are as interesting as its findings.
The study, published in Value in Health, compared two of the most common types of second-line diabetes medications — sulfonylureas (SUs) and thiazolidinediones (TZDs) — in a sample of patients already on the standard first line treatment, metformin. If the study’s instrumental variable (IV) assumptions are accepted (about which, more below) the authors found that, relative to TZDs, SUs cause a 68% increase in risk of avoidable hospitalization and a 50% increase in risk of death. Results for experiencing a heart attack or stroke were not statistically significant.
The study was based on an analysis of merged Veterans Health Administration (VA) and Medicare data for over 80,000 VA patients followed for up to ten years. Full disclosure: it was led by my colleague Julia Prentice, and I work closely with another co-author, Steve Pizer.
As discussed in the paper, as well as the accompanying press release, choosing a second line treatment for type 2 diabetes has become increasingly complex. Existing studies fail to provide all the information clinicians and patients need to make informed choices. (This was also discussed at the recent New England Comparative Effectiveness Public Advisory Council (CEPAC) meeting on the subject). Existing randomized trial findings are either based on too brief follow-up or are underpowered to yield statistically significant mortality findings, for example. The Prentice study is based on a sample about 20 times the size of prior type 2 diabetes medication RCTs, offering sufficient power to study mortality and other low-frequency outcomes.
The study is based on an IV analysis in which prescribing patterns are used as a source of random variation. Importantly, for this purpose, VA patients are assigned to primary care physicians at random. The instrument was, for each patient, the proportion of second line prescriptions (SUs or TZDs) written for SUs by his provider over the year prior to the date on which the patient initiated SU or TZD. Supporting this approach, prescribing pattern has been applied as an instrument in prior work.
But is it a good instrument? The key question is whether it is biased by correlation with any unobservable factor that also affects outcomes.
Prentice et al. offer strong evidence that it is not, with several falsification tests. First, they stratify demographic, comorbidity, and provider quality data by above and below median prescribing rates, showing they are balanced. This is the analog to a table 1 in an RCT, which provides evidence of validity of randomization — a type of falsification test.
Next, the authors examined two populations that did not receive the treatment under study but potentially should be subject to the same omitted variable bias — if there is any — as the primary sample: (1) a population on metformin but not on a second line treatment and (2) a population that had initiated metformin and then insulin without any other drug. For neither population was their instrument related to outcomes, indicating that the instrument only affects outcomes through it’s effect on treatment with SU or TZD, i.e., it is not correlated with an omitted variable that affects outcomes.
What’s particularly nice about these two falsification test populations is that they bracket the study population in disease severity. Population (1) is somewhat healthier, having not moved on to a second line treatment. Population (2) is somewhat sicker (something the authors confirmed), having moved on to insulin. It’s a rare IV study that includes such a thorough and convincing validation.
What the authors don’t say explicitly, but I will, is that these methods are generalizable to other comparative effectiveness observational studies. Provided certain conditions are met, practice pattern variation can be used as an instrument, though one should always validate it with falsification tests whenever it’s applied. The IV + falsification test pair strikes me as a powerful and useful tool, though not one applicable to all problems, to be sure.
I’ll conclude with two sets of questions for knowledgeable readers:
Are the clinical findings from this study convincing? Should they influence clinical practice? Is an RCT required, if one of sufficient size and duration could be accomplished? Is that remotely feasible?
Do the methods applied in this study offer a scale-able approach to the big data causal inference problem? That is, could more analysts be trained in the application of IV and falsification tests? If so, what’s the next step? If not, what’s the reliable, alternative approach in situations in which there is reasonable concern of omitted variable bias?
Comments are open for one week for responses to these two sets of questions.
I found insightful his emperor-has-no-clothes argument that the RCT is not deserving of its “gold standard” reputation, despite rhetoric to the contrary. I speculate RCTs have achieved their special status for several reasons:
They are relatively conceptually simple, requiring less mathematical and statistical training than is required of many other methods. (Though, the basic explanation of them hides a lot of complexity, which leads to improper use and interpretation, as Deaton shows.)
RCTs address the problem of confounding from unobservables (though this fact is not unique to RCTs), which, historically, has been a major impediment to causal inference in social sciences and in the advancement of medicine. (As Deaton explains, such confounding is not the only problem confronting empirical methods, and RCTs do not necessarily address the others better than nonexperimental methods.)
RCTs lend themselves to a routinized enterprise of evidence-based change (e.g., in medicine) in a way that other strong methods for causal inference do not (or not yet). Equivalently simple approaches that could be easily routinized offer far weaker support for causal inference. It is plausible to me that promotion of RCTs as the methodologically strongest approach to causality has spared us from many more studies of associations that can’t come even close to RCTs’ validity for causal inference, imperfect though it may be. It’s possible association-type studies could do a lot of damage to human welfare. (Evidence-based, pre-RCT medicine was pretty sketchy, for example.) This, perhaps, is the strongest moral justification for claiming that RCTs are “the gold standard,” even if they do not merit that unique standing: a world in which that is less widely believed could be much worse.
Perhaps because of the forgoing features of RCTs, they have been adopted as the method of choice by high-powered professionals and educators in medical science (among other areas). When one is taught and then repeats that RCTs are “the gold standard” and one is a member of a highly respected class, that view carries disproportionate weight, even if there is a very good argument that it is not necessarily the correct view (i.e., Deaton’s, among others). Another way to say this is that the goldenness of RCTs’ hue should be judged on its merits of each application; we should be careful not to attribute to RCTs a goldenness present in the tint of glasses we’ve been instructed to wear.
Let me be clear, Deaton is not claiming (nor am I) that some other method is better than RCTs. He is simply saying that there does not exist one method (RCTs, say) that deserves preferential status, superior to all others for all subjects and all questions about them. I agree: there is no gold standard.
At the same time, applying some standards in judging methodology is necessary. How this ought to be done varies by context. Official bodies charged with guarding the safety of patients (e.g., the FDA or the USPSTF) are probably best served with some fairly hard-and-fast rules about how to judge evidence. Too much room for judgement can also leave too much room for well-financed charlatans to sneak some snake oil through the gate.
Academics and the merit review boards that judge their research proposals or the referees that comment on their manuscripts have more leeway. My view in this context is that a lot rides on the precise question one is interested in, the theoretical or conceptual model one (or the community of scholars) thinks applies to it, and the data available to address it, among other possible constraints. This is not a set-up for a clean grading system; there’s no substitute for expertise and opinions will vary. These are major limitations of the acceptance that there is no hierarchy to quality of methodology, in general.
Below are my highlights from Deaton’s paper, with my emphasis added. Each bullet is a direct quote.
[Analysts] go immediately to the choice of instrument , over which a great deal of imagination and ingenuity is often exercised. Such ingenuity is often needed because it is difficult simultaneously to satisfy both of the standard criteria required for an instrument, that it be correlated with [treatment] and uncorrelated with [unobservables affecting outcomes]. […] Without explicit prior consideration of the effect of the instrument choice on the parameter being estimated, such a procedure is effectively the opposite of standard statistical practice in which a parameter of interest is defined first, followed by an estimator that delivers that parameter. Instead, we have a procedure in which the choice of the instrument, which is guided by criteria designed for a situation in which there is no heterogeneity, is implicitly allowed to determine the parameter of interest. This goes beyond the old story of looking for an object where the light is strong enough to see; rather, we have at least some control over the light but choose to let it fall where it may and then proclaim that whatever it illuminates is what we were looking for all along.
Angrist and Jorn Steffen Pischke (2010) have recently claimed that the explosion of instrumental variables methods  has led to greater “credibility” in applied econometrics. I am not entirely certain what credibility means, but it is surely undermined if the parameter being estimated is not what we want to know.
Passing an overidentification test does not validate instrumentation. [Here’s why.]
The value of econometric methods cannot and should not be assessed by how closely they approximate randomized controlled trials. […] Randomized controlled trials can have no special priority. Randomization is not a gold standard because “there is no gold standard” . Randomized controlled trials cannot automatically trump other evidence, they do not occupy any special place in some hierarchy of evidence, nor does it make sense to refer to them as “hard” while other methods are “soft.” These rhetorical devices are just that; metaphor is not argument, nor does endless repetition make it so.
One immediate consequence of this derivation is a fact that is often quoted by critics of RCTs, but often ignored by practitioners, at least in economics: RCTs are informative about the mean of the treatment effects  but do not identify other features of the distribution. For example, the median of the difference is not the difference in medians, so an RCT is not, by itself, informative about the median treatment effect, something that could be of as much interest to policymakers as the mean treatment effect. It might also be useful to know the fraction of the population for which the treatment effect is positive, which once again is not identified from a trial. Put differently, the trial might reveal an average positive effect although nearly all of the population is hurt with a few receiving very large benefits, a situation that cannot be revealed by the RCT.
How well do actual RCTs approximate the ideal? Are the assumptions generally met in practice? Is the narrowness of scope a price that brings real benefits or is the superior ity of RCTs largely rhetorical? RCTs allow the investigator to induce variation that might not arise nonexperimentally, and this variation can reveal responses that could never have been found otherwise. Are these responses the relevant ones? As always, there is no substitute for examining each study in detail, and there is certainly nothing in the RCT methodology itself that grants immunity from problems of implementation.
In effect, the selection or omitted variable bias that is a potential problem in nonexperimental studies comes back in a different form and, without an analysis of the two biases, it is impossible to conclude which estimate is better—a biased nonexperimental analysis might do better than a randomized controlled trial if enrollment into the trial is nonrepresentative.
Running RCTs to find out whether a project works is often defended on the grounds that the experimental project is like the policy that it might support. But the “like” is typically argued by an appeal to similar circumstances, or a similar environment, arguments that depend entirely on observable variables. Yet controlling for observables is the key to the matching estimators that are one of the main competitors for RCTs and that are typically rejected by the advocates of RCTs on the grounds that RCTs control not only for the things that we observe but things that we cannot. As Cartwright notes, the validity of evidence-based policy depends on the weakest link in the chain of argument and evidence, so that by the time we seek to use the experimental results, the advantage of RCTs over matching or other econometric methods has evaporated. In the end, there is no substitute for careful evaluation of the chain of evidence and reasoning by people who have the experience and expertise in the field. The demand that experiments be theory-driven is, of course, no guarantee of success, though the lack of it is close to a guarantee of failure.
The paper is very readable, though I skipped (or lightly skimmed) a middle section that did not appear to have a high density of general advice, if any. There’s some math, but it’s simple and, in some places, important for understanding key points, including a few I quoted above. Only in one or two spots did I find the words insufficient to understand the meaning. Perhaps they were a bit too efficient. Find the paper, ungated, here.
In a post last week—which you should read if you don’t know what an “instrumental variable” (IV) is—I described the key assumption for a good IV: it’s uncorrelated with any unobservable factor that affects the outcome. “Unobservable” is a term of art. Practically speaking, it actually means anything not controlled for in the system of equations used to produce the IV estimate.
Example: Consider a study of the effect of cardiac catheteriation in heart attack patients on mortality in which the investigators employ the following IV: the difference in distance between (a) each patient’s residence and the nearest hospital that offers cardiac catheterization and (b) each patient’s residence and the nearest hospital of any type. (See McClellan, McNeil, and Newhouse, for example.) Is this differential distance a good instrument? Without controlling for other factors, probably not.
It’s no doubt true that the closer patients live to a hospital offering catheterization relative to the distance to any hospital, the more likely they are to be catheterized. But hospitals that offer catheterization may exist in areas that are disproportionately populated by patients of a certain type (e.g., more urban areas where patients have a different racial mix than elsewhere) and, in particular, of a type that experiences different outcomes than others. To the extent the instrument is correlated with observable patient factors (like race) that affect outcomes, the IV can be saved. One just controls for those factors by including them as regressors in the IV model.
The trouble is, researchers don’t always include the observable factors they should in their IV specifications. For instance, in the above example if race is correlated with differential distance and mortality, leaving it out will bias the estimate of the effect of catheterization on mortality. This is akin to an RCT in which treatment/control assignment is partially determined by a relevant health factor that isn’t controlled for. Bad, bad, bad. (But, this can happen, even in an RCT designed and run by good people.)
In a recent Annals of Internal Medicine paper, Garabedian and colleagues found substantial problems of this type in a large sample of IV studies. They examined 65 IV studies published before 2012 with mortality as an outcome and that used any of the four most common instrument types: facility distance (e.g., differential distance), or practice patterns at the regional, facility, or physician level. Then they scoured the literature to find evidence of confounding for these instruments and categorized the observable factors that, if not controlled for, could invalidate them. Finally, they tabulated the proportion of studies that controlled for various potential confounders. That table is below.
Some confounders, like patient comorbidites, are often included as controls. Others, like education or procedure volume are much less commonly included. This is worrisome. Here are the authors’ final two sentences:
 Any instrumental variable analysis that does not control for likely instrument–outcome confounders should be interpreted with caution.
This is excellent advice. I agree.
 Although no observational method can completely eliminate confounding, we recommend against treating instrumental variable analysis as a solution to the inherent biases in observational CER [comparative effectiveness research] studies.
My emphasis is added, because this is devastating and, I think, inconsistent with the paper’s contribution. I cannot agree with it. Here’s why:
It’s one thing to find that many specific IV studies are poorly done, but it’s quite another to suggest IV, in general, should not play a role in addressing bias in observational CER. Indeed, there are many poor studies using any technique. There are bad propensity score studies. There are flawed RCTs. Does that make these inappropriate techniques for CER?
Let’s consider the alternatives: Propensity scores do not address confounding from unobservables, so they can’t be a complete solution all CER problems. RCTs are impractical in many cases. And even when they’re not, it’s a good idea to do some preliminary observational studies using methods most likely to credibly estimate causal effects. It seems to me we need IVs precisely because of these limitations with other techniques. (And we need the other techniques as well.)
The authors identified precisely how to do more credible IV studies. Shouldn’t we use that knowledge to do IVs that can better address selection bias instead of concluding we cannot?
Not every confounder the authors identified need be included in every IV study. Just because a factor might confound doesn’t mean it does confound. This is important because some factors aren’t readily available in research data (e.g., race, income, or education can be missing from the data—in many cases they can be included at an area level, however). To reassure ourselves that no important factors are among the unobservables, one can perform some falsification tests, which are very briefly mentioned by the authors. This validity check deserves far more attention, and is beyond the scope of this post.
The Garabedian paper is a tremendous contribution, an instant classic. I applaud the authors for their excellent and laborious work. Anybody who is serious about IV should read it and heed its advice … right up until the last sentence. It’s possible to do good IV. The paper shows how and then, oddly, walks away from it.
Though there are lots of sources to learn about instrumental variables (IV), in this post I’ll point to three papers I found particularly helpful.
I’ve already written a tutorial post on IV, based on a paper by my colleague Steve Pizer. Two diagrams from that paper make clear that IV is a generalization of randomized controlled trials (RCTs). Conceptually, an RCT looks like this:
Randomization (e.g., by the flip of a coin) ensures that the characteristics of patients in the treatment and comparison groups have equal expected values. The two groups are drawn from the same sample of recruits and the only factor that determines their group assignment is the coin flip, so, apart from the treatment itself, all other differences between the groups must by construction be random.
An IV study could look like the diagram below. Notice that if you ignore the patient and provider characteristics boxes on the left and the lines that emanate from them and interpret the institutional factors box at the bottom as a coin flip, this looks exactly like an RCT.
[In an IV study, a] large number of observed and unobserved factors [could] influence sorting into treatment and comparison groups. Many of these factors are also independently associated with differences in the outcome. These relationships are illustrated by the solid arrows connecting observed and unobserved patient and provider characteristics to sorting and the dashed arrows connecting these same characteristics directly to the outcome. The arrows directly to the outcome are dashed because these relationships are not the ones of primary interest to the investigator; in fact, these are potentially confounding relationships that could make it difficult or impossible to accurately measure the effect of treatment.
What makes an IV study an IV study is the analytical exploitation of some “institutional factors” (e.g., laws, programs) or other measurable features of the world—called instrumental variables—that affect sorting into treatment and control groups, at least somewhat, and are, arguably,* not correlated with any unobservable patient or provider factors that also affect outcomes. That’s kind of a mind-bender, but notice that an RCT’s coin flip has these properties: it’s a measurable feature of the world, affects sorting, and is not correlated with any unobservable patient or provider factors. Other things in the world can, arguably,* act like a coin flip, at least for some patients: program eligibility that varies geographically (like that for Medicaid), for example.
The algebraic expression of the forgoing difference between RCTs and IV studies by Katherine Harris and Dahlia Remler may also be informative to you (if you’re not math-averse). They consider patient i‘s health outcome, yi, given by
 yi = β(hi)di + g(hi) + εi
where hi denotes unobservable health status; di is a dichotomous variable that takes the value one if the patient is treated and zero otherwise; β(hi) + g(hi) is the expected health outcome if the patient receives the treatment; g(hi) is the expected health outcome if the patient does not receive the treatment; and εi represents the effect of other unobserved factors unrelated to health status. The effect of treatment for each individual is the difference between health outcomes in the treated and untreated state, β(hi). If treatment effects are homogenous, then β(hi) = β for everyone. If treatment effects are heterogeneous, then β(hi) is different for [at least some patients].
Next, the probability that patient i receives treatment can be written
 P(di=1) = f(hi) + zi
where f(hi) represents health status characteristics that determine treatment assignment, and zi represents factors uncorrelated with health status that have a nontrivial impact on the probability of receiving treatment.
A potential problem in this setup is that treatment and outcome depend on health status hi, which is unobservable. If unobservably sicker people are treated and are also more likely to have a bad outcome (because they are sicker), that will bias our judgment of the effect of treatment. The way out is to find or manufacture a zi that determines treatment assignment for at least some patients in a way that is uncorrelated with unobservable health hi (as well as uncorrelated with other unobservable factors that affect treatment, εi.)
In experimental settings, researchers strive to eliminate the effect of health status on the treatment assignment process shown in Equation 2 by randomly generating (perhaps in the form of a coin-flip) values of zi such that they are uncorrelated with health status and then assigning subjects to treatment and control groups on the basis of its value. […]
In some nonexperimental settings, it may be possible to identify one or more naturally occurring zi, [IVs] that influence treatment status and are otherwise uncorrelated with health status. When this is the case, it is possible to estimate a parameter that represents the average effect of treatment among the subgroup of patients in the sample for whom the IV determines treatment assignment.
Harris and Remler go on to discuss more fully (with diagrams!) the subgroup of patients to which an IV estimate (aka, the local average treatment effect or LATE) applies when treatment effects are heterogeneous. With Monte Carlo simulations, they show that LATE estimates can differ considerably from the average treatment effect (ATE) one would obtain if one could estimate it for the entire population. Their explanation is beautiful and well worth reading, but too long for this post.
* “Arguably” because one needs to provide an argument for the validity of an instrumental variable. This is a mix of art and science, well beyond the scope of this post. I will come back to this in the future.
In the outcomes research and comparative effectiveness research literature, there are strong cautionary tales on the use of instrumental variables (IVs) that may influence the newly initiated to shun this premier tool for casual inference without properly weighing their advantages. It has been recommended that IV methods should be avoided if the instrument is not econometrically perfect. The fact that IVs can produce better results than naïve regression, even in nonideal circumstances, remains underappreciated. In this paper, we propose a diagnostic criterion and related software that can be used by an applied researcher to determine the plausible superiority of IV over an ordinary least squares (OLS) estimator, which does not address the endogeneity of a covariate in question. Given a reasonable lower bound for the bias arising out of an OLS estimator, the researcher can use our proposed diagnostic tool to confirm whether the IV at hand can produce a better estimate (i.e., with lower mean square error) of the true effect parameter than the OLS, without knowing the true level of contamination in the IV.
There’s been some chatter about how the Oregon Medicaid study is or might be biased. That’s worth a post!
There’s a precise way in which the study is not biased. By design it estimated the effect of Medicaid on those who won the lottery and enrolled, relative to those who lost the lottery and did not. This estimate is unbiased for the contrast between precisely these two groups, but not necessarily for others. In econometric jargon, this is known as the “local average treatment effect” (LATE). The “treatment effect” part of “LATE” is clear, but what’s this “local average” business?
Sigh. I hate this terminology. It’s supposed to evoke the idea that the instrument (the lottery in this case) doesn’t have a “global” effect on study participants, causing all randomized to Medicaid (lottery winners) to be on and all those randomized to control (lottery losers) to not be. It has a more modest, “localized” effect. The other jargon used for this is that the LATE estimate is an estimate of the effect of treatment on “compliers.” That’s a more meaningful term to me. The compliers are those that do what randomization “tells” them to do, they enroll in Medicaid if randomized to do so and they don’t if not.
Of course, you can’t expect full compliance in this study (or many other RCTs) because some lottery winners turned out to be ineligible for Medicaid by the time they were permitted to enroll. Some had too high income. Some moved out of state. Some may have found other sources of coverage. (You had to have income below 100% FPL, live in state, and uninsured for 6 months to be permitted to enroll.) Also, enrollment wasn’t mandatory. So, if you just decided it wasn’t worth the trouble or didn’t receive or notice the letter inviting enrollment, you might have missed the window (45 days is all they gave you).
On the flip side, nobody was preventing lottery losers from enrolling on Medicaid if they became eligible in another way. The study pertained only to the expansion of Medicaid beyond the statutory requirements. If people ended up in one of the eligible categories (aged, blind, disabled, pregnant) they could get on Medicaid.
So, there was considerable “crossover” (lottery losers enrolling in Medicaid, lottery winners not) or “contamination” or “noncompliance,” all jargon for the same thing. This was not a perfect RCT. Few are.
What to do? The investigators did two things. First, they considered an “intent-to-treat” (ITT) approach, comparing lottery winners to losers no matter whether they enrolled in Medicaid or not. These results are in their first year paper. I’ve forgotten what they say specifically, though in general they’re much smaller effects than the LATE results. The concern with ITT is that all this crossover biases the results toward zero. There isn’t as much contrast between study arms due to noncompliance.
Next, the investigators provided LATE estimates, about which I wrote above. These are unbiased for contrast among compliers. In this study, they’re about four times the size of the ITT estimates by virtue of the mathematics (“instrumental variables“) of LATE. But they need not be the same as one would find in the absence of noncompliance. There may be bias in that sense. Why?
Hypothesis 1: Those who took the trouble to enroll in Medicaid were sicker than those who didn’t. After all, why enroll if you don’t need it? Remember, even some lottery losers (18.5% of them) enrolled in Medicaid. The LATE estimate removes the effect of them since they are noncompliers. Also, some lottery winners didn’t enroll (most of them didn’t) and the LATE estimate removes their effect too. What’s left under this hypothesis is a comparison of relatively sicker people who did enroll in Medicaid with relatively healthier people who didn’t. The investigators actually found some evidence to suggest that Medicaid enrollees are sicker. Many other studies find that Medicaid enrollees are sicker to the point that some studies find an association of Medicaid with increased mortality. Under hypothesis 1, results are biased downward relative to what they would be under full compliance. Medicaid looks less effective than it might otherwise be.
Hypothesis 2: Those who are more organized, better planners, with higher cognitive function and literacy (including health) skills enroll. It takes some awareness and planning to enroll, so there is some face validity to this argument. I’m aware of no evidence to support it though. (Got any?) Under this hypothesis Medicaid enrollees would do a better job of getting and staying healthy even apart from whatever Medicaid does for them. This would bias results toward showing a larger Medicaid effect than would be true in general (under full compliance).
There may be other hypothetical sources of bias. The point I’d make about all of them is that we don’t know whether any of these biases actually exist and, if they do, how big an effect they have. It’s all speculation. Still, LATE is an unbiased (and causal) estimate of the effect of Medicaid on compliers. It does filter out some who want to be on Medicaid and can’t enroll (lost lottery, no other route) and filters out some who enroll but weren’t invited (lost lottery but became eligible another way). Some of these noncompliers could be unusually sick. Some noncompliers could be unusually organized and aware. LATE filters some of them out.
Some might wonder about another type of estimate one could do, the effect of “treatment on the treated.” Here one just compares Medicaid enrollees to non-enrollees, ignoring the lottery draw. Unfortunately, this just exacerbates whatever bias might exist. There is no random assignment at play here. There’s no filtering for selection at all. You get an association, not a causal estimate. This is the problem with many studies of Medicaid and insurance. Randomness is key. The lottery should be exploited in some fashion (either ITT or LATE).
Lastly, notice how complicated RCT interpretation is? Yes, it’s the gold standard, but it still has issues. Using an IV approach for a LATE estimate is, in my view, about the best you can do. But there may be bias when considering generalizing the findings outside the “local” effect of the instrument (lottery or random assignment). These concerns arise with any IV study. In this sense, IV and RCT are much closer cousins than one tends to think. Disparage one and you disparage the other.
Not all that’s gold glitters, but it is still valuable.
The study I wrote about earlier this week by Hadley et al. is just one of many to apply instrumental variables (IV) to analysis of cancer treatment (prostate in that case). Zeliadt and colleagues do so as well (also for prostate cancer) and cite several others. Both the Hadley and Zeliadt studies exploit practice pattern variation, specifically differences in prior year(s) rates of treatment across areas, to define IVs.
For you to buy the results, you have to believe that lagged treatment rates strongly predict actual treatment (this can be shown) and, crucially, are not otherwise correlated outcomes, controlling for observable factors (this mostly requires faith). I would not believe the IVs valid if there were clear, accepted standards about whether and what treatment is best. If that were so, then treatment rates could be correlated with quality, broadly defined. Higher quality care might be expected in areas that follow the accepted standard more closely. Better outcomes could be do to broadly better care, not just to the particular treatment choice.
However, in prostate cancer, there is no standard about what treatment is best. I accept the IVs as valid in this case.
Among the other cancer treatment IV studies I found, some of which Zeliadt cites, several also exploit practice pattern variations:
Yu-Lao et al.: Again, prostate cancer, and, notably, appearing in JAMA. Yes, JAMA published an IV study based on practice pattern variation. More on why I am excited about that below.
I cannot say whether practice patterns make for valid IVs for breast and lung cancer at the time the Brooks and Earle studies were published. I’d have to think about it, and I have not. I merely note that exploiting practice pattern variation for IV studies is not novel, though it is not widely accepted either, particularly in medical journals. I think it should be, though only for cases for which a good argument about validity can be made, as I believe it can be for prostate cancer and, I am sure, some other conditions.
Of course I would prefer to see more randomized controlled trials (RCTs) on all the areas of medicine in need of additional evidence. But those areas are, collectively, a massive territory. We neither have the time nor have we demonstrated a willingness to spend the money required to conduct RCTs in all areas. We have to prioritize. For cases for which IV studies are likely to be reasonably valid, we ought to apply the technique, not necessarily instead of an RCT — though with resource constraints, such an argument could be made — but certainly in advance of one.
IV studies are cheaper, faster, and offer other advantages. They don’t require enrollment of patients. They can exploit the large, secondary data sets already in existence (Medicare, Medicaid, VA, commercial payers, hospital systems, and the like). As such, they permit stratification by key patient demographics that RCTs are often underpowered to support. Even when an RCT is warranted, a good IV study done in advance can help to refine questions and guide hypotheses.
Given the vast need for evidence that overwhelms our capacity to provide it via RCTs, there isn’t a good argument for not doing IV studies in cases for which they justifiably valid. However, part of the package of scaling up an IV research agenda is publishing the findings in top journals — not just health economics journals, but also top medical journals like JAMA. This will require more clinical reviewers of manuscripts to gain comfort with the IV approach (start here). It will also require medical journals to solicit reviews by those who can vouch for instruments’ validity or point out when they are unlikely to be so.
It’s hard and expensive to create purposeful randomness, as is required in an RCT. Yet, there is so much natural randomness around. We should be exploiting it. Good quasi-randomness is a terrible thing to waste.
I thought I had blogged on this paper before, but I can’t find a prior post. So, here are some quotes and brief comments on Zeliadt, S. B., Ramsey, S. D., Penson, D. F., Hall, I. J., Ekwueme, D. U., Stroud, L., & Lee, J. W. (2006). Why do men choose one treatment over another? Cancer, 106(9), 1865-1874.
In the largest study we reviewed, which involved 1000 patients, approximately 42% of patients defined an effective treatment as one that extended expected survival or delayed disease progression, whereas 45% indicated that effectiveness meant preservation of quality of life (QOL).5 This is in contrast to physicians, 90% of whom defined effectiveness as extending expected survival. In another study, fewer than 20% of patients ranked either “effect of treatment on length of life” or “chances of dying of cancer” as 1 of the 4 most important factors in making a decision.26 In 1 study of health state preferences, 2 of 5 men were unconditionally willing to risk side effects for any potential gain in life expectancy.64 These studies suggest that there is substantial variation in the significance that patients place on cancer eradication, and that treatment efficacy means more than “control” of the tumor for many patients.
Concerns regarding cancer eradication appear to correlate directly with aggressiveness of therapy, with radical prostatectomy being the choice preferred by the majority of patients who focus on cancer control.
Side effects like incontinence and impotence are frequently cited concerns, as reported in the paper. However,
To our knowledge, there is limited information available regarding how men balance side effects in making their treatment decision. For example, although preservation of sexual function was rated as very important by 90% of men age younger than 60 years, and 79% of men age 75 years and older, in a separate question only 3% of these same men indicated that “having few side effects” was the most important consideration in initiating therapy.5 Fear of side effects was also stated by only 3% of men in a study in North Carolina, in which the majority of patients were black.8 Srirangam et al.45 reported that although 55% of spouses reported that side effects were important, only 6% indicated that side effects were deciding factors. One study comparing surgery and brachytherapy reported that 25% of patients chose between these 2 options based on the side effect profile.9 In addition, although Holmboe and Concato10 found that 49% of patients were concerned with incontinence and 38% were concerned with impotence, only 13% reported weighing the risks and benefits of treatment. These studies demonstrate the apparent disconnect between patients’ stated importance of side effects and the role that they actually play in reaching the final treatment decision.
Ultimately, it’s what patients actually do, not what they say, that matters. Therefore, side effects may be less of a relevant factor in treatment decisions than is commonly believed. Put another way, that something is a concern doesn’t imply that it changes one’s decision. That does not take away from the fact that concerns are psychologically important.
Less frequently discussed are concerns about other potential complications.
Fear of surgical complications was emphasized by some men who selected watchful waiting. 7 A different study found that complications due to surgery were of concern to 12% of patients when considering surgery.3 A belief that radiation is harmful rather than therapeutic was offered by some men who selected surgery.44 When considering radiation therapy, 21% of men indicated concern about skin burns.3 Long recovery times were cited by 17% of patients.10 For a small percentage of men, issues such as fear of surgery or radiation appeared to be the primary factor in their decision regarding treatment.
One reason complications and side effects may play a relatively small role in treatment decisions is that physicians are playing a large role in influencing those decisions.
The role of the physician recommendation has received considerable attention in prostate cancer decision making due to the widely recognized preferences held by each physician specialty. As might be expected, opinions regarding the optimal treatment for localized prostate cancer vary among urologists, radiation oncologists, oncologists, and general practitioners. Urologists nearly universally indicate that surgery is the optimal treatment strategy, and radiation oncologists similarly indicate that radiation therapy is optimal.78
To the extent that treatment choice is driven by physicians and is otherwise unrelated to outcomes, it suggests an opportunity for a valid instrument. This is what Hadley, et al. exploited.
The paper continues with an exploration of the role of family members, race, socioeconomic, and psychological factors in treatment choice. Some of these (e.g., family relationships, socioeconomic factors, psychological factors) are likely to be incompletely observed and are, therefore, additional possible reasons why instrumental variables and RCT results differ from naive, observational studies. They key is that they may also be related to outcomes. It’s not too hard to imagine they could be.
We selected 14,302 early-stage prostate cancer patients who were aged 66–74 years and had been treated with radical prostatectomy or conservative management from linked Surveillance, Epidemiology, and End Results–Medicare data from January 1, 1995, through December 31, 2003. Eligibility criteria were similar to those from a clinical trial used to benchmark our analyses. Survival was measured through December 31, 2007, by use of Cox proportional hazards models. We compared results from the benchmark trial with results from models with observational data by use of traditional multivariable survival analysis, propensity score adjustment, and instrumental variable analysis.
This is an important exercise. Here’s why:
The randomized controlled trial is considered the most valid methodology for assessing treatments’ efficacy. However, randomized controlled trials are costly, time consuming, and frequently not feasible because of ethical constraints. Moreover, some randomized controlled trial results have limited generalizability because of differences between randomized controlled trial study populations, who may be screened for eligibility on the basis of age and comorbidities, and community populations, who are likely to be much more heterogeneous with regard to health conditions and socioeconomic characteristics.
To this, add that RCTs are often under-powered for stratification by key patient characteristics. This is where observational studies shine. Of course, biased selection is the principal concern with observational studies.
Patient selection into specific treatments is an important consideration in all observational studies, but particularly for those in prostate cancer, because incidence is highest in the elderly who are also most likely to have multiple comorbidities.
Observational study techniques are not equivalent in their ability to address the selection problem.
Observational studies (1,11–13) have previously used traditional regression and propensity score methods to evaluate associations between specific prostate cancer treatments with survival. In these studies, the propensity score methods did not completely balance (ie, equalize) important patient characteristics such as tumor grade, size, and comorbidities across treatment groups. Furthermore, patients who received active treatment had better survival for noncancer causes of death than patients who received conservative management, indicating that unobserved differences between groups affected both treatment choice and survival.
Instrumental variable analysis is a statistical technique that uses an exogenous variable (or variables), referred to as an “instrument,” that is hypothesized to affect treatment choice but not to be related to the health outcome (14–17). Variations in treatment that result from variations in the value of the instrument are considered to be analogous to variations that result from randomization and so address both observed and unobserved confounding. Instrumental variable analysis has been used with observational data to investigate clinical treatment effects among patients with breast cancer (18–20), lung cancer (21), or prostate cancer (5,22).
The study findings support the use of instrumental variables.
Propensity score adjustments resulted in similar patient characteristics across treatment groups, and survival was similar to that of traditional multivariable survival analyses. The instrumental variable approach, which theoretically equalizes both observed and unobserved patient characteristics across treatment groups, differed from multivariable and propensity score results but were consistent with findings from a subset of elderly patient with early-stage disease in the randomized trial.
The authors’ preferred instrument captures practice pattern variation.
We constructed the primary instrumental variable for treatment received by use of a two-step process. First, we used the entire dataset (n = 17,815) to estimate the probability of receiving conservative management as a function of patients’ clinical characteristics (tumor stage and grade, NCI comorbidity index, and Medicare reimbursements for medical care in the previous year), demographics (age, race, ethnicity, and marital status), year of diagnosis, and all possible interactions among these variables. Second, we calculated the difference between the actual proportion of patients receiving conservative management and the average predicted probability of receiving conservative management (generated from the logistic regression model) in each hospital referral region by year. Areas with relatively large positive differences between the actual and predicted proportions of patients receiving conservative management favor a conservative management treatment pattern, and areas with large negative differences between the actual and predicted proportions of patients receiving conservative management favor a radical prostatectomy treatment pattern. We then lagged this measure of the local area treatment pattern by 1 year and linked it to each patient in the analysis to enhance the instrument’s independence from patients’ current health and unobserved characteristics.
Treatment propensity (ie, the predicted probability of receiving conservative management) for the propensity score analysis and for constructing the lagged area treatment pattern for the instrumental variable analysis was estimated with logistic regression. The survival models were estimated with Cox proportional hazard models. Visual inspection of the parallelism of the Kaplan–Meier plots of the logarithms of the estimated cumulative survival models by treatment supported the proportionality assumption. The instrumental variable version of the Cox hazard model was estimating with the two-stage residual inclusion method (38), which has been shown to be appropriate for nonlinear outcome models. […]
[The instrument’s] independence of the survival outcomes was confirmed by its lack of statistical significance as an independent variable in an alternative version (data not shown) of the Cox survival models.
One acknowledged limitation, among many, is that PSA values were not available to the researchers. Another is that
a complete statistical assessment of the Cox hazard model’s proportionality assumption indicated that the effects of some covariates may not be time invariant, especially in the analysis of all-cause mortality. Although a sensitivity analysis of the effects of allowing time-varying covariates did not alter the principle findings with regard to treatment effects, further analysis of time-varying effects may be warranted.
All in all, a very nice paper. It’s worth a full read by observational researchers.
Austin and Aaron are participants in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com.