It is well known that correlation does not prove causation. What is less well known is that causation can exist when correlation is zero. The upshot of these two facts is that, in general and without additional information, correlation reveals literally nothing about causation. It is neither necessary nor sufficient for it.
Correlation without causation. My favorite hypothetical example of this is a study of thousands of middle and high school kids. The poorly informed investigators measure shoe size and reading comprehension scores. They find that the two are positively correlated. Their manuscript claiming that larger feet cause better reading skills is rejected, of course. Foot size does not cause better reading skills despite the correlation of the two.
Two elements are missing from this study. One is the measurement of age, which is related to both foot size and reading comprehension. The other missing element is a conceptual or theoretical model that provides a basis for causal interpretations of the relationships between age and foot size and between age and reading comprehension. Getting older is correlated with both and we say it is the cause of both because we have a plausible conceptual model of human development that is consistent with such an interpretation.
Causation without correlation. It is a common misconception that correlation is required for causation. Let’s start with a simple example that reveals this to be a fallacy. Suppose the value of y is known to be caused by x. The true relationship between x and y is mediated by another factor, call it A, that takes values of +1 or -1 with equal probability. The true process relating x to y is y = Ax.
It is a simple matter to show that the correlation between x and y is zero. Perhaps the most intuitive way is to imagine many samples (observations) of x, y pairs. Over the sub-sample for which the pairs have the same sign (i.e. for which A happened to be +1) y=x and the correlation is 1. Over the sub-sample for which the pairs have the opposite signs (i.e. for which A happened to be -1) y=-x and the correlation is -1. Since A is +1 and -1 with equal probability, the contributions to the total correlation from the two sub-samples cancel, giving a total correlation of zero.
Since x really does have a causal role in determining the value of y we see that causation can exist without correlation. This result hinges on the precise definition of correlation. It is a specific statistic and reveals only a little bit about how x and y relate. Specifically, if x and y are zero mean and unit variance (which we can assume without loss of generality), correlation is the expected value of their product. That single number can’t possibly tell us everything about how x might relate to y. If we didn’t know the true process y=Ax and the statistics of A in advance we might be tempted to say that x cannot cause y due to a lack of correlation. That would be an incorrect conclusion. Correlation and our lack of understanding of it would be misleading us.
But there are other statistics to consider. In the example above x and y are uncorrelated but their magnitudes are not. That is, there are functions of x and functions of y that are correlated. This must be so because the two relate to each other (causally) somehow. In general, evidence consistent with the causal relationship is found in the probability density of y conditioned on x. If x causes y then that conditional probability, p(y|x), must be a function of (vary with) x. It is possible for p(y|x) to depend on x yet for the correlation of x and y to be zero. But causation cannot exist if p(y|x) is independent of x. Or, put even more simply, though x and y can be both uncorrelated and causally related, they cannot be statistically independent and causally related.
Advanced example. (This is a bit more advanced so some readers may wish to skip it.) I’ll close with a nice real world like example offered by my colleague Steve Pizer. Suppose we have good theoretical reasons to believe that illness causes death. Let
y = death (1 if dead, 0 if alive),
x = illness (1 if sick, 0 if not),
t = administration of treatment (1 if treated, 0 if not),
e = other unobservable factors (could be anything).
The true (hypothetical!) model of death is y = (1-t)x + e. That is if an individual is ill (x=1) and doesn’t get treatment (t=0) they would surely die apart from the effects of other factors denoted by e. On the other hand, sick individuals who do get treated live, again ignoring e. Assume the correlation of t and x is very high (like 0.99). That is, nearly everyone who is ill gets treatment and almost nobody who is not ill does. Therefore, hardly anyone who contracts the illness actually dies from it.
If we estimate this model without observing t, we would find that illness and death are uncorrelated. Such a finding might tempt us to question our theory that illness causes death. This would be a mistake because we’ve omitted an important factor, treatment t, in the analysis. However, if we can observe t, then the high but imperfect correlation between t and x might make it possible to estimate the true effect of illness on death, using appropriate econometric techniques. We might therefore learn the degree to which illness (untreated) causes death, consistent with our theory.
The foregoing is an illustration of the type of incorrect conclusions that can result from improper analysis of observational study data (as opposed to a randomized trial). Steve has written a very handy tutorial paper [pdf] on this topic, which I recommend highly to anyone working on observational studies or wishing to better understand them. Additional exploration of the econometric issues is provided in the Background (Section 2) and Set-up (Section 3.1) of a recent NBER paper by Millimet and Tchernis.
Later: For more on this topic, see my follow-up posts.