• It is well known that correlation does not prove causation. What is less well known is that causation can exist when correlation is zero. The upshot of these two facts is that, in general and without additional information, correlation reveals literally nothing about causation. It is neither necessary nor sufficient for it.

Correlation without causation. My favorite hypothetical example of this is a study of thousands of middle and high school kids. The poorly informed investigators measure shoe size and reading comprehension scores. They find that the two are positively correlated. Their manuscript claiming that larger feet cause better reading skills is rejected, of course. Foot size does not cause better reading skills despite the correlation of the two.

Two elements are missing from this study. One is the measurement of age, which is related to both foot size and reading comprehension. The other missing element is a conceptual or theoretical model that provides a basis for causal interpretations of the relationships between age and foot size and between age and reading comprehension. Getting older is correlated with both and we say it is the cause of both because we have a plausible conceptual model of human development that is consistent with such an interpretation.

Causation without correlation. It is a common misconception that correlation is required for causation. Let’s start with a simple example that reveals this to be a fallacy. Suppose the value of y is known to be caused by x. The true relationship between x and y is mediated by another factor, call it A, that takes values of +1 or -1 with equal probability. The true process relating x to y is y = Ax.

It is a simple matter to show that the correlation between x and y is zero. Perhaps the most intuitive way is to imagine many samples (observations) of x, y pairs. Over the sub-sample for which the pairs have the same sign (i.e. for which A happened to be +1) y=x and the correlation is 1. Over the sub-sample for which the pairs have the opposite signs (i.e. for which A happened to be -1) y=-x and the correlation is -1. Since A is +1 and -1 with equal probability, the contributions to the total correlation from the two sub-samples cancel, giving a total correlation of zero.

Since x really does have a causal role in determining the value of y we see that causation can exist without correlation. This result hinges on the precise definition of correlation. It is a specific statistic and reveals only a little bit about how x and y relate. Specifically, if x and y are zero mean and unit variance (which we can assume without loss of generality), correlation is the expected value of their product. That single number can’t possibly tell us everything about how x might relate to y. If we didn’t know the true process y=Ax and the statistics of A in advance we might be tempted to say that x cannot cause y due to a lack of correlation. That would be an incorrect conclusion. Correlation and our lack of understanding of it would be misleading us.

But there are other statistics to consider. In the example above x and y are uncorrelated but their magnitudes are not. That is, there are functions of x and functions of y that are correlated. This must be so because the two relate to each other (causally) somehow. In general, evidence consistent with the causal relationship is found in the probability density of y conditioned on x. If x causes y then that conditional probability, p(y|x), must be a function of (vary with) x. It is possible for p(y|x) to depend on x yet for the correlation of x and y to be zero. But causation cannot exist if p(y|x) is independent of x. Or, put even more simply, though x and y can be both uncorrelated and causally related, they cannot be statistically independent and causally related.

Advanced example. (This is a bit more advanced so some readers may wish to skip it.) I’ll close with a nice real world like example offered by my colleague Steve Pizer. Suppose we have good theoretical reasons to believe that illness causes death. Let

y = death (1 if dead, 0 if alive),
x = illness (1 if sick, 0 if not),
t = administration of treatment (1 if treated, 0 if not),
e = other unobservable factors (could be anything).

The true (hypothetical!) model of death is y = (1-t)x + e. That is if an individual is ill (x=1) and doesn’t get treatment (t=0) they would surely die apart from the effects of other factors denoted by e. On the other hand, sick individuals who do get treated live, again ignoring e. Assume the correlation of t and x is very high (like 0.99). That is, nearly everyone who is ill gets treatment and almost nobody who is not ill does. Therefore, hardly anyone who contracts the illness actually dies from it.

If we estimate this model without observing t, we would find that illness and death are uncorrelated. Such a finding might tempt us to question our theory that illness causes death. This would be a mistake because we’ve omitted an important factor, treatment t, in the analysis. However, if we can observe t, then the high but imperfect correlation between t and x might make it possible to estimate the true effect of illness on death, using appropriate econometric techniques. We might therefore learn the degree to which illness (untreated) causes death, consistent with our theory.

The foregoing is an illustration of the type of incorrect conclusions that can result from improper analysis of observational study data (as opposed to a randomized trial). Steve has written a very handy tutorial paper [pdf] on this topic, which I recommend highly to anyone working on observational studies or wishing to better understand them. Additional exploration of the econometric issues is provided in the Background (Section 2) and Set-up (Section 3.1) of a recent NBER paper by Millimet and Tchernis.

Later: For more on this topic, see my follow-up posts.

item.php
• Point taken, but I think you’d do better with a more accessible (read: less mathy) example as the meat of your argument.

• @Pete Michaud – I wish it were possible. Without the math it just comes down to “correlation doesn’t tell you much, and nothing on its own about causality.” The math just proves it.

A subsequent post will go deeper into the source of our notions of causation.

The interesting non-mathy aspect of this is that few people can retain these facts in an intuitive way. Even those who get it frequently make causal inferences where none are warranted. It is hard to live a day without doing so. It seems wired into our brains. But it does “cause” us to make mistakes. 🙂

• Humans will always dive headlong into this without a second thought, you only have to hear a squeaky wheel before you’ll start making presumptions and predictions about how it should squeak from there on in.

We seem to be very ego based and unable to step outside of our own frame of reference which just compounds the problem.

I like the math, I found it exciting, I nearly wee’d a little.

• Great post on a great topic, about which you’re clearly much better educated than I.

But it seems to me that “The upshot of these two facts is that, in general and without additional information, correlation reveals literally nothing about causation” is itself the sort of a statement that is leaving out additional information. Though it IS a very strong statement, and thus drives traffic to the blog!

The missing information is whether causation does, on average, result in correlation or not. (Is it a safe “folk physics” assumption?) I think it does (is). Conditions that are neither necessary nor sufficient may, nonetheless, be very closely (ahem) correlated with their “outcomes” (forgive me for not knowing the right terms … and in getting all “meta” since in this case the “condition” I’m suggesting might be correlated is itself “correlation”).

To my uneducated eye, I’d say that even though correlation does not mean causation and causation does not mean correlation, they nonetheless travel very closely together. In other words: the instances in which causation results in correlation are far more frequent in the world than those in which it does not. (Maybe that’s why the examples had so much mathyness?)

Surely if X causes Y instances of X are more likely to hang around with instances of Y than not.

Though I also have a bad feeling I’m missing the depth of what you’re saying, and you may need to dumb it down for me.

• @James Bronzan – Your thought provoking comment inspired a new post on this topic. See http://theincidentaleconomist.com/reader-response-causation-bias/

• Very interesting post, Austin. I’m wondering how you think this relationship applies to the current debate on climate change – it seems extremely pertinent. While the prevailing wisdom is that CO2 and other greenhouse gasses cause global warming, there are dissenting views that can’t seem to get much airtime. One of the so-called “deniers” whose contrarian view I find quite interesting is that of Peter Taylor, a UK anthropologist. He argues that although there is certainly warming going on in some parts of the planet, and that there may be a correlation between CO2 and temperature, but the notion that climate change is caused by CO2 is false. It has more to do with cloud cover, solar activity, etc.

I’m not sure where I come out in the debate, but it’s certainly interesting to examine other sides in the debate, rather than just going along with the prevailing wisdom that the matter is settled. His talk on Google video is long, but worthwhile, and I think delves into the issue you’ve raised here.

• @Sterling Zumbrunn – Thanks, and thanks for the link.

I’d be extrapolating far out of my comfort zone to suggest I have anything to add to the climate change debate (if we can call it that). In many things I’m comfortable accepting what seems to be the scientific consensus, at least as the prevailing working hypothesis and knowing all the while no such thing can every be proven. That is to say, I trust science and the scientific community, even with its imperfections and bias. It’s the best we have.

And by “accepting” the prevailing view I mostly mean taking that view as the most likely true one that I carry about in a reflexive way. I’m always willing to entertain challenges to anything in principle, but with finite time and resources, I don’t enter a debate on just anything at any time anyone else opens the door to one. In many things there is no point in my doing so other than entertainment value. So for now I’ll stay focused on health policy and various aspects of economics and its application with which I’m familiar.

• Correlation is required for causality.

In your hypothetical example, A would be a hidden variable and there would be both a causal and correlation relationship between X and A and Y and A. No correlation would exist between X and Y, because they are not causative factors.

Also in your example, you are using negative and positive values to give a correlation of 0. Correlation statistics use R^2 values to prevent this from happening. The example doesn’t work.

If you can think of a single real world example where this works, I am all ears. 😉

• How are x and A correlated?

x and y are correlated by the definition of correlation. There may be other metrics that don’t yield zero, but that’s not what I am talking about.

The example at the end is closer to “real world.” But it stems from a mistake (though of a type commonly made). However, I do think it is next to impossible for causation to exist in the real world without correlation.

• I find it interesting that your denial of ‘causation requires correlation’ is justified is by a causal claim based on correlation!

I’m pretty sure you’d agree that (outside of mathematical relationships and axiomatic definitions) knowledge can only be derived empirically.

So when you use the example of y = Ax, you claim that we know before hand that x causes y, as in the absence of A, y = x. But how do you know that y = x for any natural world phenomena if not by observation and correlation? You can’t (perhaps some God whispers the truth in your ear but then how do you know he’s telling the truth?).

The only way we can ‘know’ things about the natural world is from correlation. Causal claims require knowledge to be justified, and therefore require correlation. Your example may be hidden causation, but you can’t make that claim without knowledge that y and x correlate in the absence of A!

PS, in your life equation y = (1-t)x + e, you say that e is all unobservable factors. Illness must be observable (because you’ve included it it in your equation) therefore e excludes illness and treatment. How many people die without either illness or treatment – very few I’m guessing.

• I found the advanced example easier to understand since you used an actual example

Thanks

• Here is an example that I think works. Take for example a study that looks at the factors which make people vote for the Republican Party (Y). One of our key variables maybe ‘Reading News Online’ (x) which has a positive effect on your likelihood of voting republican. However we also know that older people access the internet less than younger people. That is age (which we will call z) has a negative correlation with X. We also know that older people are more likely to vote Republican (i,e z has a positive correlation with Y). If the correlations are of similar magnitude, it is possible that a raw correlation coefficient between x and y would be 0 even if we know X causes Y.

The example isn’t perfect but it basically illustrates his point that correlations are tricky. It was also suggested that “it is near impossible for causation to exist without correlation” is a little hasty. Maybe that’s because you have been looking for causation using correlations?

• Wouldn’t the y = Ax produce a bimodal distribution in y?

Every probabilistic problem requires a histogram and corresponding probability distribution function (PDF). So, there cannot be causation without correlation, where correlation is defined typically by a PDF.

• To me it seems we are mincing words. What your examples appear to prove is not that causation does not require correlation, rather than the lack of correlation in one set of observations does not necessarily negate causation.

It seems here that we are more raising questions with the question of whether the appearance of correlation is necessary for causation which, of course, it is not. The appearance of correlation may not exist at all given poorly constructed studies that ignore important factors.

However, in actuality, with all other things appropriately controlled, illness causes death and causation requires correlation.