# Methods: Flavors of validity plus a ton of bonus content

Before I get to the main subject of this post, I want to encourage you to read in full the papers about the frameworks and methods of Campbell and Rubin in this 2010 issue of Psychological Methods. (If you only have time to read one, I recommend that by Stephen West and Felix Thoemmes.) The papers cover a wide range of issues pertaining to causal inference in experimental and observational study designs. To my eye, they do so very well and with almost no math. (I illustrate the style of math used below.)

Though there are a number of differences and similarities between Campbell’s and Rubin’s frameworks, a few are emphasized:

• Campbell put greater emphasis on employing study design to mitigate threats to validity (about which more below). Rubin emphasized statistical methods to remedy defects that threaten validity.
• Campbell’s framework focused more on the direction of causal effects. Rubin was as concerned with their magnitude as well.

Now, about “validity” and its various types, Stephen West and Felix Thoemmes wrote,

We designate X as an indicator of treatment (e.g., 1 = Treatment [T]; 0 = Control [C]) and Y as the outcome (dependent) variable. The central concern of internal validity is whether the relationship between the treatment and the outcome is causal in the population under study. Does the manipulation of X produce change in Y? Or, does some other influence produce change in Y? Note that internal validity does not address the specific aspect(s) of the treatment that produce the change nor the specific aspect(s) of the outcome in which the change is taking place—nor does it address whether the treatment effect would hold in a different setting, with a different population, or at a different time. These issues are questions of construct validity and external validity, respectively.

Granted, that was a bit rushed, so here’s William Shadish’s take on flavors of validity:

1. Statistical conclusion validity: The validity of inferences about the correlation (covariation) between treatment and outcome.

2. Internal validity: The validity of inferences about whether observed covariation between A (the presumed treatment) and B (the presumed outcome) reflects a causal relationship from A to B, as those variables were manipulated or measured.

3. Construct validity: The validity with which inferences can be made from the operations in a study to the theoretical constructs those operations are intended to represent.

4. External validity: The validity of inferences about whether the cause – effect relationship holds over variation in persons, settings, treatment variables, and measurement variables.

Originally, Campbell (1957) presented 8 threats to internal validity and 4 threats to external validity. The lists proliferated, although they do seem to be reaching an asymptote: Cook and Campbell (1979) had 33 threats, and Shadish et al. (2002) had 37.

That’s about all I care to post/quote about validity. (As with all my methods posts, you should read the papers or a textbook for details.) Now for some bonus, though related, coverage of some of the contents of two papers in that Psychology Methods issue.

Stephen West and Felix Thoemmes conveyed the setup of Rubin’s causal model as follows:

Formally, each participant’s causal effect, the individual treatment effect, is defined as YT(u) – YC(u), where YT(u) represents the response Y of unit u to treatment T, and YC(u) represents the response of unit u to treatment [or control] C. Comparison of these two outcomes provides the ideal design for causal inference. […] Unfortunately, this design is a Platonic ideal that can never be achieved in practice.

Why? Because for each individual unit, u, we never know the effects of both treatment arms, T and C under precisely identical conditions. We only observe, at most, one. The other (or some estimate of it) must be inferred by other means. This is the entire problem of causal inference.

The model makes it clear that we can observe two sets of participants: (a) Group A given T and (b) Group B given C. A and B may be actual pre-existing groups (e.g., two communities) or they may be sets of participants who have selected or have been assigned to receive the T and C conditions, respectively. Of key importance, we also need to conceptualize the potential outcomes in two hypothetical groups: (c) Group A given C and (d) Group B given T. Imagine that we would like to compare the mean outcome of the two treatments. Statistically, in terms of the ideal design what we would ideally like to have is an estimate of either μT(A) – μC(A) or μT(B) – μC(B) [actually, ideally, both] where A and B designate the group[s] to which the treatment [and control, respectively] was given. Both Equations [] represent average causal effects. Of importance, note that [they] may not represent the same average causal effect; Groups A and B may represent different populations. […]

[W]hat we would like to estimate is a weighted combination λ[μT(A) – μC(A)] + (1 –λ)[μT(B) – μC(B)], where […] λ is the proportion of the population that is in the treatment group. […]

What we have [from study data] in fact is the estimate of μT(A) – μC(B). […]

For observed outcomes, only half of the data we would ideally like to have can be observed; the other half of the data is missing. This insight allows us to conceptualize the potential outcomes as a missing data problem and focuses attention on the process of assignment of participants to groups as a key factor in understanding problems of selection.

Basically, the entire enterprise of causal inference is to design and employ methods to better estimate (in the sense of minimizing the threats to validity defined above) the unobserved counterfactual means μC(A) and/or μT(B) or, what amounts to the same thing, their difference from those that are observed.

I found the following fascinating:

The researcher will need to conceptualize carefully the alternative treatment that the individual could potentially receive (i.e., compared with what?). Of importance [] this definition makes it difficult to investigate the causal effects of individual difference variables because we must be able to at least conceptualize the individual difference (e.g., gender) as two alternative treatments. If we cannot do this, Rubin (1986) considers the problem ill defined.

Wow! So, for example, Rubin (1986) would consider the problem of the causal effect of gender on, say, wages “ill defined” because there is no conceivable possibility of a male being female or vice versa in the same sense in which someone can take a drug or not. I probably don’t need to point out that the causal effect of gender on wages is a significant research question of considerable cultural, policy, and political import at the moment. What exactly it means for it to be “ill defined” I don’t know, though I could speculate. But I’ve downloaded Rubin (1986) and one day I may read it and find out.

Here are a couple of passages I highlighted in the paper by William Shadish:

[Campbell] is skeptical of the results of any single study, encouraging programs of research in which individual studies are imbued with different theoretical biases and, more important, inviting criticisms of studies by outside opponents who are often best situated to find the most compelling alternative explanations.

Endorse! Also,

The regression discontinuity design [] was invented in the 1950s by Campbell (Thistlewaite & Campbell, 1960), but a statistical proof of its unbiased estimate was provided by Rubin (1977) in the 1970s (an earlier unpublished proof was provided by Goldberger, 1972).

This I did not know. I may post more on the content of a few other papers in the collection. I’m still working my way through them.

@afrakt