The paper by Thomas Cook, William Shadish, and Vivian Wong, “Three Conditions under Which Experiments and Observational Studies Produce Comparable Causal Estimates: New Findings from Within-Study Comparisons,” makes some good points. Below I quote from their paper, referencing some of my prior posts that express similar sentiments.
At least in some disciplines, randomized designs have a “privileged role,” supported by education and the research establishment.
The randomized experiment reigns supreme, institutionally supported through its privileged role in graduate training, research funding, and academic publishing. However, the debate is not closed in all areas of economics, sociology, and political science or in interdisciplinary fields that look to them for methodological advice, such as public policy. […] Alternatives to the experiment will always be needed, and a key issue is to identify which kinds of observational studies are most likely to generate unbiased results. We use the within-study comparison literature for that purpose.
We should not expect results from observational studies with strong designs for causal inference to match those from experimental approaches in all cases.
But the procedure used in these early studies contrasts the causal estimate from a locally conducted experiment with the causal estimate from an observational study whose comparison data come from national datasets. Thus, the two counterfactual groups differ in more than whether they were formed at random or not; they also differ in where respondents lived, when and how they were tested, and even in the actual outcome measures. […] The aspiration is to create an experiment and an observational study that are identical in everything except for how the control and comparison groups were formed. […] We should not confound how comparison groups are formed with differences in estimators.
Past within-study comparisons from job training have been widely interpreted as indicating that observational studies fail to reproduce the results of experiments. Of the 12 recent within-study comparisons reviewed here from 10 different research projects, only two dealt with job training. Yet eight of the comparisons produced observational study results that are reasonably close to those of their yoked experiment, and two obtained a close correspondence in some analyses but not others. Only two studies claimed different findings in the experiment and observational study, each involving a particularly weak observational study. Taken as a whole, then, the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within-study comparison literature.
RCTs are simple to explain, but that’s just one criterion and not the most important one.
[Observational methods] do not undermine the superiority of random assignment studies where they are feasible. Th[ose] are better than any alternative considered here if the only criterion for judging studies is the clarity of causal inference. But if other criteria are invoked, the situation becomes murkier. The current paper reduces the extent to which random assignment experiments are superior to certain classes of quasi-experiments, though not necessarily to all types of quasi-experiments or nonexperiments. Thus, if a feasible quasi-experiment were superior in, say, the persons, settings, or times targeted, then this might argue for conducting a quasi-experiment over an experiment, deliberately trading off a small degree of freedom from bias against some estimated improvement in generalization.
But we should be concerned about accepting bad designs because they either (1) are simple or (2) have shown themselves to match RCTs in a different setting. We need to evaluate each design in the context of the particular questions being asked on each study.
For policymakers in research-sponsoring institutions that currently prefer random assignment, this is a concession that might open up the floodgates to low-quality causal research if the carefully circumscribed types of quasi-experiments investigated here were overgeneralized to include all quasi-experiments or nonexperiments. Researchers might then believe that “quasi-experiments are as good as experiments” and propose causal studies that are unnecessarily weak. But that is not what the current paper has demonstrated. Such a consequence is neither theoretically nor empirically true but could be a consequence of overgeneralizing this paper.
Even those of us who argue these points probably agree on this:
We suspect that few methodologically sophisticated scholars will quibble with the claim that […] the notion that understanding, validating, and measuring the selection process will substantially reduce the bias associated with populations that are demonstrably nonequivalent at pretest.
Clearly I have not told you much about their study or findings. You’ll have to read the paper for that.