• Methods: Imbens on Campbell and Rubin

    Here’s Guido Imbens’ commentary in that excellent 2010 issue of Psychological Methods I hyped (twice) a couple of weeks ago. The whole thing is excellent, and I highlighted two passages:

    As a side note, it is interesting to contrast the profound impact the approaches of Campbell and Rubin have had on empirical work in economics with that of the graphical approach. Although the graphical approach to causality has been around for more than 2 decades (Pearl, 1995, 2009a; Spirtes, Glymour, & Scheines, 2001), it has had virtually no impact on practice in economics. Whereas Pearl (2009b) appears to see this as a lack of open mindedness in economics, the fast and widespread adoption of aspects of Campbell’s work and Rubin’s approach suggests the willingness of economists to adopt new methods, as long as the benefits are transparent. My personal view is that the proponents of the graphical approach, unlike Campbell and Rubin, have not demonstrated convincingly to economists that adopting (part) of their framework offers sufficient benefits relative to the framework currently used by economists.

    It’s been 15 years since I spent any time contemplating “the graphical approach.” At that time, I was far more interested in prediction than causal inference. (These are different in the sense that correlations that have nothing to do with causation can still be helpful for some kinds of prediction problems but, by definition, confound causal inference.) Recently it was suggested to me that I reconsider graphical models. Considering opportunity cost, Imbens’ “don’t bother” is highly influential, however.


    Unconfoundedness implies that comparisons of outcomes for units that differ in terms of treatment status but are homogeneous in terms of observed covariates have a causal interpretation. In other words, if we find a pair of units with the same covariate values, one treated and one control, then the difference in outcomes is unbiased for the average effect of the treatment for units with those values of the covariates.

    Let me make some comments on this, because the assumption has generated much more controversy in economics than one might expect. This is partly because the assumption that units that look alike in terms of observed characteristics but that are in different treatment regimes are directly comparable is often suspect. If these units look alike in terms of background characteristics, but they made different choices, it must be because they are, in fact, different in terms of unobserved characteristics. In other words, if they were the same in terms of all relevant characteristics, why would they make different choices? The underlying concern among economists is that such an assumption may be difficult to reconcile with optimal behavior by individuals.

    This is precisely my beef with those who find convincing observational studies of insurance status (e.g., Medicaid vs. uninsured) that don’t control for selection on unobservables. Even controlling for a wide variety of observable characteristics, one is left with essentially the conundrum that Imbens raises. Why did two otherwise seemingly identical individuals make such different choices? Is it plausible that whatever caused them to do so (e.g., unobservable disease severity or community support) could also affect health outcomes that are the focus of such studies? Yes, it is. This is among the most useful ways economists think that many others do not.


    Comments closed
  • I don’t think “slowdown” means what you think it means

    Via Amitabh Chandra:



    Comments closed
  • Vacation

    Starting now, I’ll be off the internet and without screens for a week. I do this annually and love it. (Yeah, I had a bonus week last month, but normally this is a once-per-year event.)

    As I’ve said before, you should try this too. I find that when I come back, I get about 7-10 days during which almost nothing irritates me. I more accurately (in my view) recognize most of the chatter designed to play on my emotions as noise (yes, I’m looking at you social media and you traditional media). It’s a great way to live!

    When I return, I expect you all to have fully sorted out all Halbig-related issues. Talk amongst yourselves.


    Comments closed
  • The regions of Australia

    Via tastefullyoffensive:

    austrailia dog cat


    Comments closed
  • Methods (kinda): Rubin on Rubin and Campbell

    Yesterday I encouraged you to read at least the paper by Stephen West and Felix Thoemmes if not all the papers on Campbell’s and Rubin’s causal frameworks in this 2010 issue of Psychological Methods. I also encourage you to read the response by Rubin. It’s much shorter and so much fun. Here are my highlights.

    Because my doctoral thesis was on matched sampling in observational studies under Cochran, I thought that I understood the general context fairly well, and so I was asked by the Educational Testing Service to visit Campbell at Northwestern University in Evanston, Illinois, which is, incidentally, where I grew up. I remember sitting in his office with, I believe, one or two current students or perhaps junior faculty. The topic of matching arose, and my memory is that Campbell referred to it as “sin itself” because of “regression to the mean issues” when matching on fallible test scores rather than “true” scores. I was flabbergasted!

    Rubin later showed that he was correct about matching but that Campbell was not wrong because Rubin had misunderstood his objection.

    Of course, the situation with an unobserved covariate used for treatment assignment is far more complex, and that situation, coupled with the naive view that matching can fix all problems with nonrandomized studies, appears to have been the context for Campbell’s comment on matching.

    (I may put up a methods post on matching at some point, though I haven’t decided.)

    The drive for clarity in what one is trying to do expressed in this passage resonates deeply:

    Perhaps because of my physics background, it seemed to me to make no sense to discuss statistical methods and estimators without first having a clear concept of what one is attempting to estimate, which, I agree with Shadish (2010), was a limitation of Campbell’s framework. Nevertheless, Campbell is not alone when implicitly, rather than explicitly, defining what he was trying to estimate. A nontrivial amount of statistical discussion (confused and confusing to me) eschews the explicit definition of estimands. [...] My attitude is that it is critical to define quantities carefully before trying to estimate them.

    Elsewhere in the paper, Rubin reveals that even Campbell did not think very highly of his own ability to do math. Rubin studied physics with John Wheeler at Princeton, which one can’t do without a lot of math ability and confidence in it.

    Later in the paper he has a very nice discussion of the stable unit treatment value assumption (SUTVA), which I won’t repeat here. Very roughly, the aspect of it that’s relevant below is that there be one treatment (or at least a clearly defined set of them), not a vague, uncountable cloud of them. (See also, Wikipedia.) It’s due to this assumption that the problem of, say, the causal effect of gender on wages is “ill defined,” as I raised in my prior post.

    For example, is the statement “She did well on that literature test because she is a girl” causal or merely descriptive? If [being assigned to the "control" group] means that this unit remains a girl and [being assigned to the "treatment" group] means that this unit is “converted” to a boy, the factual [the outcome from assignment to "control"]  is well defined and observed, but the counterfactual [outcome due to "treatment"] appears to be hopelessly ill-defined and therefore unstable. Does the hypothetical “converted to a boy” mean an at-birth sex-change operation, or does it mean massive hormone injections at puberty, or does it mean cross-dressing from 2 years of age, and so forth? Only if all such contemplated hypothetical interventions can be argued to have the same hypothetical [outcome] will the requirement of SUTVA that there be no hidden versions of treatments be appropriate for this unit.

    But this does not mean there can be no well-defined study of the causal effects of gender.

    An example of a legitimate causal statement involving an immutable characteristic, such as gender or race, occurs when the unit is a resume of a job applicant sent to a prospective employer, and the treatments are the names attached to the resume, either an obviously Anglo Saxon name ["control"] or an obviously African American name ["treatment"].

    They key here is that though you can’t in a reasonably defined, unique way imagine changing the gender of a person, you can imagine changing the gender as listed on a person’s resume.

    Later still, Rubin explains how, before his work, the “observed outcome notation” that had been the norm made it impossible to be clear how and why certain designs permit unbiased estimates. You really have to read the paper (at least) to see this. I’m still not sure I get it, but I believe him!

    To repeat, using the observed outcome notation entangles the science [all the potential outcomes and observable factors] and the assignments [the mechanism by which observed outcomes are selected among potential ones]—bad! Yet, the reduction to the observed outcome notation is exactly what regression approaches, path analyses, directed acyclic graphs, and so forth essentially compel one to do. For an example of the confusion that regression approaches create, see Holland and Rubin (1983) on Lord’s paradox or the discussion by Mealli and Rubin (2003) on the effects of wealth on health and vice versa. For an example of the bad practical advice that the directed acyclic graph approaches can stimulate, see the Rubin (2009) response to letters in Statistics in Medicine. [...]

    To borrow Campbell’s expression, I believe that the greatest threat to the validity of causal inference is ignoring the distinction between the science and what one does to learn about the science, the assignment mechanism—a fundamental lesson learned from classical experimental design but often forgotten. My reading of Campbell’s work on causal inference indicates that he was keenly aware of this distinction.

    (I may read and then post on Lord’s paradox. I don’t know what it is yet.)


    Comments closed
  • Methods: Flavors of validity plus a ton of bonus content

    Before I get to the main subject of this post, I want to encourage you to read in full the papers about the frameworks and methods of Campbell and Rubin in this 2010 issue of Psychological Methods. (If you only have time to read one, I recommend that by Stephen West and Felix Thoemmes.) The papers cover a wide range of issues pertaining to causal inference in experimental and observational study designs. To my eye, they do so very well and with almost no math. (I illustrate the style of math used below.)

    Though there are a number of differences and similarities between Campbell’s and Rubin’s frameworks, a few are emphasized:

    • Campbell put greater emphasis on employing study design to mitigate threats to validity (about which more below). Rubin emphasized statistical methods to remedy defects that threaten validity.
    • Campbell’s framework focused more on the direction of causal effects. Rubin was as concerned with their magnitude as well.

    Now, about “validity” and its various types, Stephen West and Felix Thoemmes wrote,

    We designate X as an indicator of treatment (e.g., 1 = Treatment [T]; 0 = Control [C]) and Y as the outcome (dependent) variable. The central concern of internal validity is whether the relationship between the treatment and the outcome is causal in the population under study. Does the manipulation of X produce change in Y? Or, does some other influence produce change in Y? Note that internal validity does not address the specific aspect(s) of the treatment that produce the change nor the specific aspect(s) of the outcome in which the change is taking place—nor does it address whether the treatment effect would hold in a different setting, with a different population, or at a different time. These issues are questions of construct validity and external validity, respectively.

    Granted, that was a bit rushed, so here’s William Shadish’s take on flavors of validity:

    1. Statistical conclusion validity: The validity of inferences about the correlation (covariation) between treatment and outcome.

    2. Internal validity: The validity of inferences about whether observed covariation between A (the presumed treatment) and B (the presumed outcome) reflects a causal relationship from A to B, as those variables were manipulated or measured.

    3. Construct validity: The validity with which inferences can be made from the operations in a study to the theoretical constructs those operations are intended to represent.

    4. External validity: The validity of inferences about whether the cause – effect relationship holds over variation in persons, settings, treatment variables, and measurement variables.

    Originally, Campbell (1957) presented 8 threats to internal validity and 4 threats to external validity. The lists proliferated, although they do seem to be reaching an asymptote: Cook and Campbell (1979) had 33 threats, and Shadish et al. (2002) had 37.

    That’s about all I care to post/quote about validity. (As with all my methods posts, you should read the papers or a textbook for details.) Now for some bonus, though related, coverage of some of the contents of two papers in that Psychology Methods issue.

    Stephen West and Felix Thoemmes conveyed the setup of Rubin’s causal model as follows:

    Formally, each participant’s causal effect, the individual treatment effect, is defined as YT(u) – YC(u), where YT(u) represents the response Y of unit u to treatment T, and YC(u) represents the response of unit u to treatment [or control] C. Comparison of these two outcomes provides the ideal design for causal inference. [...] Unfortunately, this design is a Platonic ideal that can never be achieved in practice.

    Why? Because for each individual unit, u, we never know the effects of both treatment arms, T and C under precisely identical conditions. We only observe, at most, one. The other (or some estimate of it) must be inferred by other means. This is the entire problem of causal inference.

    The model makes it clear that we can observe two sets of participants: (a) Group A given T and (b) Group B given C. A and B may be actual pre-existing groups (e.g., two communities) or they may be sets of participants who have selected or have been assigned to receive the T and C conditions, respectively. Of key importance, we also need to conceptualize the potential outcomes in two hypothetical groups: (c) Group A given C and (d) Group B given T. Imagine that we would like to compare the mean outcome of the two treatments. Statistically, in terms of the ideal design what we would ideally like to have is an estimate of either μT(A) – μC(A) or μT(B) – μC(B) [actually, ideally, both] where A and B designate the group[s] to which the treatment [and control, respectively] was given. Both Equations [] represent average causal effects. Of importance, note that [they] may not represent the same average causal effect; Groups A and B may represent different populations. [...]

    [W]hat we would like to estimate is a weighted combination λ[μT(A) – μC(A)] + (1 –λ)[μT(B) – μC(B)], where [...] λ is the proportion of the population that is in the treatment group. [...]

    What we have [from study data] in fact is the estimate of μT(A) – μC(B). [...]

    For observed outcomes, only half of the data we would ideally like to have can be observed; the other half of the data is missing. This insight allows us to conceptualize the potential outcomes as a missing data problem and focuses attention on the process of assignment of participants to groups as a key factor in understanding problems of selection.

    Basically, the entire enterprise of causal inference is to design and employ methods to better estimate (in the sense of minimizing the threats to validity defined above) the unobserved counterfactual means μC(A) and/or μT(B) or, what amounts to the same thing, their difference from those that are observed.

    I found the following fascinating:

    The researcher will need to conceptualize carefully the alternative treatment that the individual could potentially receive (i.e., compared with what?). Of importance [] this definition makes it difficult to investigate the causal effects of individual difference variables because we must be able to at least conceptualize the individual difference (e.g., gender) as two alternative treatments. If we cannot do this, Rubin (1986) considers the problem ill defined.

    Wow! So, for example, Rubin (1986) would consider the problem of the causal effect of gender on, say, wages “ill defined” because there is no conceivable possibility of a male being female or vice versa in the same sense in which someone can take a drug or not. I probably don’t need to point out that the causal effect of gender on wages is a significant research question of considerable cultural, policy, and political import at the moment. What exactly it means for it to be “ill defined” I don’t know, though I could speculate. But I’ve downloaded Rubin (1986) and one day I may read it and find out.

    Here are a couple of passages I highlighted in the paper by William Shadish:

    [Campbell] is skeptical of the results of any single study, encouraging programs of research in which individual studies are imbued with different theoretical biases and, more important, inviting criticisms of studies by outside opponents who are often best situated to find the most compelling alternative explanations.

    Endorse! Also,

    The regression discontinuity design [] was invented in the 1950s by Campbell (Thistlewaite & Campbell, 1960), but a statistical proof of its unbiased estimate was provided by Rubin (1977) in the 1970s (an earlier unpublished proof was provided by Goldberger, 1972).

    This I did not know. I may post more on the content of a few other papers in the collection. I’m still working my way through them.


    Comments closed
  • Medicare Advantage is not efficient, but here’s how it can be

    The following originally appeared on The Upshot (copyright 2014, The New York Times Company).

    Medicare Advantage plans — private plans that serve as alternatives to the traditional, public program — have been growing in popularity. One reason is that they offer additional benefits beyond those available in the traditional program but often at no additional cost to beneficiaries.

    This is a great deal for beneficiaries, but a bad one for taxpayers, who have to cover the extra cost. If the program were reorganized to more closely resemble the Affordable Care Act’s exchanges, it could still provide good value to consumers at a lower cost to taxpayers.

    The standard explanation for how Medicare Advantage plans are able to offer more has two parts, both problematic.

    1) Private plans are more efficient than the public program; they can buy more care for fewer dollars and can manage care so patients use health services with less waste. This leaves more headroom to fund extra benefits.

    2) Per person covered, plans are paid well above the average cost of providing the Medicare benefit; they can turn this payment surplus into additional benefits.

    On the first point, the assertion that Medicare Advantage plans are, on average, more efficient than traditional Medicare has little support. According to the Medicare Payment Advisory Commission, which advises Congress on Medicare payment policy, in 2014 Medicare Advantage plans could provide the same benefits for 2 percent less than the cost of traditional Medicare.

    But this analysis does not account for the fact that Medicare Advantage enrollees are at least a little bit healthier than traditional Medicare enrollees, as many studies have shown. Thus at least some of the difference in cost is due to the type of individuals drawn to Medicare Advantage, not to greater efficiency of private plans. When this is factored in, it’s unlikely that Medicare Advantage has an efficiency advantage over traditional Medicare, on average.

    The second claim is true, up to a point: Medicare Advantage plans do turn some of the higher payments they receive into extra benefits.

    But at least three studies suggest that for each dollar of these higher payments that plans receive, beneficiaries get only a fraction of a dollar of value. A study by Harvard scholars found that a $1 increase in payment translates to at most 50 cents in additional benefits. Another by researchers from the University of Pennsylvania found that only 20 cents of each additional dollar in plan payment is converted into better coverage. Finally,my own work with my colleagues Steven Pizer of Northeastern University and Roger Feldman of the University of Minnesota found that only 14 cents per dollar of additional payment benefits Medicare Advantage enrollees.

    Whether it’s 14, 20 or 50 cents on the dollar, Medicare beneficiaries are not getting the full value of taxpayers’ largess. Sure, something is better than nothing, which is why many beneficiaries passionately defend the Medicare Advantage program. But could that deal be improved?

    One way is to make Medicare Advantage plans (and traditional Medicare as well) compete more vigorously for enrollees. Today, plans receive a government subsidy according to an administratively set formula that does a poor job of matching payments to actual costs.

    An alternative based on the Affordable Care Act’s structure could work something like this: Plans would submit bids for covering the Medicare benefit, as they do today. Then, instead of basically paying them what they bid in addition to some extra (which is more or less what happens today), the government would pick one of the cheaper bids (for example, the second lowest) and just pay that to all plans. [See analysis of such a plan by Roger Feldman, Robert Coulam, and Bryan Dowd.]

    This is similar to how the Affordable Care Act, which bases subsidies on the cost of the second-cheapest silver plan, and the Medicare Part D prescription-drug plans work. Someone who wants a plan that costs more than the government payment must pay more out of pocket. Extra subsidies would be provided for low-income consumers, same as in the Affordable Care Act and Part D.

    With such a structure in place, plans would compete more vigorously. Being the second-cheapest or cheapest plan would be a huge advantage. Other plans would have to charge a premium. Plans would not offer extra stuff beneficiaries might not value highly.

    That’s a far cry from today’s Medicare Advantage, in which extra benefits are worth far less to beneficiaries than what taxpayers pay for them. It’s an inefficiency we know how to correct, but, to date, we’ve been unwilling to.

    Comments closed
  • Methods: Propensity scores

    Forthcoming in Health Services Research (and available now via Early View), Melissa Garrido and colleagues explain propensity scores. I’ve added a bit of emphasis on a key point.

    Propensity score analysis is a useful tool to account for imbalance in covariates between treated and comparison groups. A propensity score is a single score that represents the probability of receiving a treatment, conditional on a set of observed covariates. [...]

    Propensity scores are useful when estimating a treatment’s effect on an outcome using observational data and when selection bias due to nonrandom treatment assignment is likely. The classic experimental design for estimating treatment effects is a randomized controlled trial (RCT), where random assignment to treatment balances individuals’ observed and unobserved characteristics across treatment and control groups. Because only one treatment state can be observed at a time for each individual, control individuals that are similar to treated individuals in everything but treatment receipt are used as proxies for the counterfactual. In observational data, however, treatment assignment is not random. This leads to selection bias, where measured and unmeasured characteristics of individuals are associated with likelihood of receiving treatment and with the outcome. Propensity scores provide a way to balance measured covariates across treatment and comparison groups and better approximate the counterfactual for treated individuals.

    Propensity scores can be thought of as an advanced matching technique. For instance, if one were concerned that age might affect both treatment selection and outcome, one strategy would be to compare individuals of similar age in both treatment and comparison groups. As variables are added to the matching process, however, it becomes more and more difficult to find exact matches for individuals (i.e., it is unlikely to find individuals in both the treatment and comparison groups with identical gender, age, race, comorbidity level, and insurance status). Propensity scores solve this dimensionality problem by compressing the relevant factors into a single score. Individuals with similar propensity scores are then compared across treatment and comparison groups.

    Propensity scores are a useful and common technique in analysis of observational data. They are, unfortunately, sometimes misunderstood as a way to address more types of confounding than they are capable. In particular, they can only address confounding from observable factors (“measured” ones, in the above quote). If there’s an unobservable difference between treatment and control groups that affects the outcome (e.g., genetic variation about which researchers have no data), propensity scores cannot help.

    It is important to keep in mind that propensity scores cannot adjust for unobserved differences between groups.

    Only an RCT or, with assumptions, natural experiments and instrumental variables approaches can address confounding due to unobservable factors. I will return to this issue.

    I’m deliberately not covering implementation issues and approaches in these methods posts, just intuition, appropriate use, and issues of interpretation. If you want more information on propensity scores, read the paper from which I quoted or search the technical literature. Comments open for one week for feedback on propensity scores or pointers to other good methods papers.


    Comments closed
  • Methods: Intention-to-treat

    In JAMA, Michelle Detry and Roger Lewis explain the “intention-to-treat” (ITT) principle:

    [I]n a trial in which patients are randomized to receive either treatment A or treatment B, a patient may be randomized to receive treatment A but erroneously receive treatment B, or never receive any treatment, or not adhere to treatment A. In all of these situations, the patient would be included in group A when comparing treatment outcomes using an ITT analysis. Eliminating study participants who were randomized but not treated or moving participants between treatment groups according to the treatment they received would violate the ITT principle.

    Why do this?

    The effectiveness of a therapy is not simply determined by its pure biological effect but is also influenced by the physician’s ability to administer, or the patient’s ability to adhere to, the intended treatment. The true effect of selecting a treatment is a combination of biological effects, variations in compliance or adherence, and other patient characteristics that influence efficacy. Only by retaining all patients intended to receive a given treatment in their original treatment group can researchers and clinicians obtain an unbiased estimate of the effect of selecting one treatment over another.

    Treatment adherence often depends on many patient and clinician factors that may not be anticipated or are impossible to measure and that influence response to treatment.

    Why not do this?

    [1] Noninferiority trials, which are designed to demonstrate that an experimental treatment is no worse than an established one, require special considerations. [...] The intervention in group A may incorrectly appear noninferior to the intervention in group B, simply as a result of nonadherence rather than because of similar biological efficacy. [...]

    [2] Although the ITT principle is important for estimating the efficacy of treatments, it should not be applied in the same way in assessing the safety (eg, medication adverse effects) of interventions. [...]

    [3] [I]t would be unfortunate to falsely conclude, based on the ITT analysis of a phase 2 clinical trial, that a novel pharmaceutical agent is not effective when,in fact, the lack of efficacy stems from too high a dose and patients’ inability to be adherent because of intolerable adverse effects. In that case, a lower dose may yield clinically important efficacy and a tolerable adverse effect profile.

    In these cases, one may be more interested in an estimate of the effect of treatment-on-the-treated (TOT), or a per-protocol analysis.

    If you’re aware of good papers that explain the use and interpretation of common research methods, let me know in the comments, which are open for one week after this post’s time stamp, or by email or Twitter.


    Comments closed
  • Population distribution of the US in units of Canadas

    Via Stephen’s Lighthouse:

    pop US Canada units


    Comments closed