In his paper on big data, Hal Varian distinguishes prediction from causal inference. This is welcome. First, on prediction:

Machine learning is concerned primarily with prediction; the closely related field of data mining is also concerned with summarization, and particularly with finding interesting patterns in the data. […]

Much more detail about these methods can be found in machine learning texts; an excellent treatment is available in Hastie, Tibshirani, and Friedman (2009), which can be freely downloaded.

Next, on marrying machine learning prediction with what economists do (causal inference):

There are a number of areas where there would be opportunities for fruitful collaboration between econometrics and machine learning. […] [T]he most important area for collaboration involves causal inference. Econometricians have developed several tools for causal inference such as instrumental variables, regression discontinuity, difference-in-differences, and various forms of natural and designed experiments (Angrist and Krueger 2001). Machine learning work has, for the most part, dealt with pure prediction. In a way, this is ironic, since theoretical computer scientists, such as Pearl (2009a, b) have made significant contributions to causal modeling. However, it appears that these theoretical advances have not as yet been incorporated into machine learning practice to a significant degree.

How might this work?

Suppose a given company wants to determine the impact of an advertising campaign on visits to its website. It first uses [some prediction technique] to build a model predicting the time series of visits as a function of its past history, seasonal effects, and other possible predictors such as Google queries on its company name, its competitors’ names, or products that it produces. […]

It next runs an ad campaign for a few weeks and records visits during this period. Finally, it makes a forecast of what visits would have been in the absence of the ad campaign using the model developed in the first stage. Comparing the actual visits to the counterfactual visits gives us an estimate of the causal effect of advertising.

Well, under some assumptions, naturally. This presumes the counterfactual is precisely what the predictive model estimates for the period during which the ad campaign runs. Did the campaign make a difference, under that assumption? This sounds like an interrupted time series design, or a lot like it. I’d look for other time series that the ad campaign should not have affected but that would be potentially influenced by the same unobservable confounders for a falsification test. These concerns go unmentioned.

Next, Varian suggests where to place our concern:

In this period of “big data,” it seems strange to focus on sampling uncertainty, which tends to be small with large datasets, while completely ignoring model uncertainty, which may be quite large. One way to address this is to be explicit about examining how parameter estimates vary with respect to choices of control variables and instruments.

Granted, concern about sampling uncertainty (sampling error) should diminish as sample increases in size. But, we must still be concerned about sample bias, as big data samples are often samples of convenience. Are they representative of the population we imagine them to be? This should not be assumed just because we have a lot of data.