Below are my highlights from Donald Rubin’s “For Objective Causal Inference, Design Trumps Analysis.” All of the following are quotes from the paper with my emphasis added.
- All statistical studies for causal effects are seeking the same type of answer, and real world randomized experiments and comparative observational studies do not form a dichotomy, but rather are on a continuum, from well-suited for drawing causal inferences to poorly suited. For example, a randomized experiment with medical patients in which 90% of them do not comply with their assignments and there are many unintended missing values due to patient dropout is quite possibly less likely to lead to correct inferences for causal inferences than a carefully conducted observational study with similar patients, with many covariates recorded that are relevant to well-understood reasons for the assignment of treatment versus control conditions, and with no unintended missing values.
- The first part of the RCM [Rubin Causal Model] is conceptual, and it defines causal effects as comparisons of “potential outcomes”  under different treatment conditions on a common set of units. […] The second part concerns the explicit consideration of an “assignment mechanism.” The assignment mechanism describes the process that led to some units being exposed to the treatment condition and other units being exposed to the control condition. The careful description and implementation of these two “design” steps is absolutely essential for drawing objective inferences for causal effects in practice, whether in randomized experiments or observational studies, yet the steps are often effectively ignored in observational studies relative to details of the methods of analysis for causal effects. One of the reasons for this misplaced emphasis may be that the importance of design in practice is often difficult to convey in the context of technical statistical articles, and, as is common in many academic fields, technical dexterity can be more valued than practical wisdom.
- A crucial idea when trying to estimate causal effects from an observational dataset is to conceptualize the observational dataset as having arisen from a complex randomized experiment, where the rules used to assign the treatment conditions have been lost and must be reconstructed.
- Running regression programs is no substitute for careful thinking, and providing tables summarizing computer output is no substitute for precise writing and careful interpretation.
- The next step is to think very carefully about why some units (e.g., medical patients) received the active treatment condition (e.g., surgery) versus the control treatment condition (e.g., no surgery): Who were the decision makers and what rules did they use? […] In common practice with observational data, however, this step is ignored, and replaced by descriptions of the regression programs used, which is entirely inadequate. What is needed is a description of critical information in the hypothetical randomized experiment and how it corresponds to the observed data.
- It is remarkable to me that so many published observational studies are totally silent on how the authors think that treatment conditions were assigned, yet this is the single most crucial feature that makes their observational studies inferior to randomized experiments.
- No amount of fancy analysis can salvage an inadequate data base unless there is substantial scientific knowledge to support heroic assumptions. This is a lesson that many researchers seem to have difficulty learning. Often the dataset being used is so obviously deficient with respect to key covariates that it seems as if the researcher was committed to using that dataset no matter how deficient.
There’s a lot of wisdom in this paper.