Can Big Data address causal Inference?

Rocío Titiunik thinks not, in general.

The relevant question is whether big data has the potential to uncover causal relationships that could not be discovered with “small” data. […] [T]he bottleneck often is the lack of a solid research design and a credible theory, both of which are essential to develop, test, and accumulate causal explanations. […]

The fundamental problem of causal inference is that for every unit, we fail to observe the value that the outcome would have taken if the chosen level of the treatment had been different (Holland 1986 ). Therefore, the search for causal inferences is a search for assumptions under which we can infer the values of these unobserved counterfactual outcomes from observed data. The question at the center of my argument is whether access to big data fundamentally increases the likelihood that those assumptions will hold.

“Big data” can mean many things. In particular, it could mean a large number of observations and/or a large number of variables. We can dispense with the first one relatively easily:

[N]o increase in the number of observations, no matter how large, will cause the omitted variable bias in a mis-specified [] model to disappear.

One argument that a large number of variables may not help is that something important could still be missing.

Without a theory and a research design, it is not possible to know when to stop adding to the list.

For all that, big data are still useful, as Titiunik goes on to discuss. It just isn’t a solution to causal inference by itself. I agree.


Hidden information below


Email Address*