• Methods: Propensity scores

    Forthcoming in Health Services Research (and available now via Early View), Melissa Garrido and colleagues explain propensity scores. I’ve added a bit of emphasis on a key point.

    Propensity score analysis is a useful tool to account for imbalance in covariates between treated and comparison groups. A propensity score is a single score that represents the probability of receiving a treatment, conditional on a set of observed covariates. […]

    Propensity scores are useful when estimating a treatment’s effect on an outcome using observational data and when selection bias due to nonrandom treatment assignment is likely. The classic experimental design for estimating treatment effects is a randomized controlled trial (RCT), where random assignment to treatment balances individuals’ observed and unobserved characteristics across treatment and control groups. Because only one treatment state can be observed at a time for each individual, control individuals that are similar to treated individuals in everything but treatment receipt are used as proxies for the counterfactual. In observational data, however, treatment assignment is not random. This leads to selection bias, where measured and unmeasured characteristics of individuals are associated with likelihood of receiving treatment and with the outcome. Propensity scores provide a way to balance measured covariates across treatment and comparison groups and better approximate the counterfactual for treated individuals.

    Propensity scores can be thought of as an advanced matching technique. For instance, if one were concerned that age might affect both treatment selection and outcome, one strategy would be to compare individuals of similar age in both treatment and comparison groups. As variables are added to the matching process, however, it becomes more and more difficult to find exact matches for individuals (i.e., it is unlikely to find individuals in both the treatment and comparison groups with identical gender, age, race, comorbidity level, and insurance status). Propensity scores solve this dimensionality problem by compressing the relevant factors into a single score. Individuals with similar propensity scores are then compared across treatment and comparison groups.

    Propensity scores are a useful and common technique in analysis of observational data. They are, unfortunately, sometimes misunderstood as a way to address more types of confounding than they are capable. In particular, they can only address confounding from observable factors (“measured” ones, in the above quote). If there’s an unobservable difference between treatment and control groups that affects the outcome (e.g., genetic variation about which researchers have no data), propensity scores cannot help.

    It is important to keep in mind that propensity scores cannot adjust for unobserved differences between groups.

    Only an RCT or, with assumptions, natural experiments and instrumental variables approaches can address confounding due to unobservable factors. I will return to this issue.

    I’m deliberately not covering implementation issues and approaches in these methods posts, just intuition, appropriate use, and issues of interpretation. If you want more information on propensity scores, read the paper from which I quoted or search the technical literature. Comments open for one week for feedback on propensity scores or pointers to other good methods papers.


    • The alternative methods for investigating causality are continuing to grow: propensity scores, doubly robust regression, even heckman regression, instrumental variables, marginal structural models, SEM (sorta). A lot of Pearl’s modern work should be mentioned. There are methods that are more design based vs statistical, such as natural experiments and regression discontinuity designs.

      Even with all these methods OR using an RCT, true causality (for me) is only established through replication.

      When it comes to causal inference research, I always take the approach that these methods are providing evidence for causality, and never proving or showing causality itself.

      Just like saying some instrument is “valid” is not really true, there may be some validity evidence for a score’s interpretation but no instrument is ever really valid or not.

    • For a nice overview of matching methods, I like this paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943670/

    • Thank you for mentioning our article! I second Ryan’s mention of Elizabeth Stuart’s clear explanation of matching and propensity score methods.

      Another useful piece that covers the theory of propensity scores and matching is:

      Imbens, G.W. 2004. “Nonparametric Estimation of Average Treatment Effects under Exogeneity: A Review.” Review of Economics and Statistics 86 (1): 4–29.

      • Questions for you and Ryan about either of the papers you recommended:

        For a TIE treatment, I’m looking for a method or concept that has an aspect or interpretation that’s hard to understand or is frequently misunderstood. Then, I’m looking for a paper that illuminates just that and in fairly simple terms. Do either of the papers you suggest qualify? What’s challenging about the concept of matching or interpretation of results based on matching?

        • I think Stuart’s paper in Stat Sci (linked in Ryan’s post) might qualify. She details several misunderstandings surrounding matching, including why t-tests are not an ideal measure for assessing covariate balance and the importance of understanding the region of common support when generalizing treatment effect estimates.

    • It’s been awhile, but I believe one of the issues with using propensity scores for matching is losing subjects from both the exposed and unexposed groups due to not being able to find a match for the propensity score outliers. This could affect generalizability of the study. Stratified analyses using propensity scores may result in better generalizability with sufficient control for observed confounders.

      • You’re right that overly restrictive matches can reduce generalizability. That being said, some observations may need to be dropped – if there’s not an appropriate match in the comparison group for a treated individual, there’s not a good counterfactual from which to estimate a treatment effect.

        The difficulty with stratification or subclassification by propensity scores is that the optimal number of strata to reduce selection bias (the goal of propensity score analysis) is not always clear; it can vary with sample size. Lunceford & Davidian (Statistics in Medicine 2004; 23: 2937-2960) do a nice study of this phenomenon.