Forthcoming in Health Services Research (and available now via Early View), Melissa Garrido and colleagues explain propensity scores. I’ve added a bit of emphasis on a key point.
Propensity score analysis is a useful tool to account for imbalance in covariates between treated and comparison groups. A propensity score is a single score that represents the probability of receiving a treatment, conditional on a set of observed covariates. […]
Propensity scores are useful when estimating a treatment’s effect on an outcome using observational data and when selection bias due to nonrandom treatment assignment is likely. The classic experimental design for estimating treatment effects is a randomized controlled trial (RCT), where random assignment to treatment balances individuals’ observed and unobserved characteristics across treatment and control groups. Because only one treatment state can be observed at a time for each individual, control individuals that are similar to treated individuals in everything but treatment receipt are used as proxies for the counterfactual. In observational data, however, treatment assignment is not random. This leads to selection bias, where measured and unmeasured characteristics of individuals are associated with likelihood of receiving treatment and with the outcome. Propensity scores provide a way to balance measured covariates across treatment and comparison groups and better approximate the counterfactual for treated individuals.
Propensity scores can be thought of as an advanced matching technique. For instance, if one were concerned that age might affect both treatment selection and outcome, one strategy would be to compare individuals of similar age in both treatment and comparison groups. As variables are added to the matching process, however, it becomes more and more difficult to find exact matches for individuals (i.e., it is unlikely to find individuals in both the treatment and comparison groups with identical gender, age, race, comorbidity level, and insurance status). Propensity scores solve this dimensionality problem by compressing the relevant factors into a single score. Individuals with similar propensity scores are then compared across treatment and comparison groups.
Propensity scores are a useful and common technique in analysis of observational data. They are, unfortunately, sometimes misunderstood as a way to address more types of confounding than they are capable. In particular, they can only address confounding from observable factors (“measured” ones, in the above quote). If there’s an unobservable difference between treatment and control groups that affects the outcome (e.g., genetic variation about which researchers have no data), propensity scores cannot help.
It is important to keep in mind that propensity scores cannot adjust for unobserved differences between groups.
Only an RCT or, with assumptions, natural experiments and instrumental variables approaches can address confounding due to unobservable factors. I will return to this issue.
I’m deliberately not covering implementation issues and approaches in these methods posts, just intuition, appropriate use, and issues of interpretation. If you want more information on propensity scores, read the paper from which I quoted or search the technical literature. Comments open for one week for feedback on propensity scores or pointers to other good methods papers.