Frontiers in Massive Data Analysis, from the National Research Council, nails some of the challenges of big data.
But the challenges for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems (and classical search engines) and, instead, hinge on the ambitious goal of inference. Inference is the problem of turning data into knowledge, where knowledge often is expressed in terms of entities that are not present in the data per se but are present in models that one uses to interpret the data. Statistical rigor is necessary to justify the inferential leap from data to knowledge, and many difficulties arise in attempting to bring statistical principles to bear on massive data. Overlooking this foundation may yield results that are not useful at best, or harmful at worst. In any discussion of massive data and inference, it is essential to be aware that it is quite possible to turn data into something resembling knowledge when actually it is not. Moreover, it can be quite difficult to know that this has happened. […]
Another way of characterizing the major problems of massive data analysis is to look at the major inferential challenges that must be addressed. […]
- Assessment of sampling biases,
- Inference about tails,
- Resampling inference,
- Change point detection,
- Reproducibility of analyses,
- Causal inference for observational data, and
- Efficient inference for temporal streams. [Emphasis added.]
I worry a lot about making mistakes in inference as we leverage big data in health care. The questions I’ve been pondering are, how can we draw incorrect causal inferences, and how can we reasonably and reliably protect ourselves from doing so? I think falsification tests can help. I was curious to find out what authors of this report thought about this matter.
The report is 176 pages long. Here is the entire section devoted to causal modeling:
Harnessing massive data to support causal inference represents a central scientific challenge. Key application areas include climate change, health-care comparative effectiveness and safety, education, and behavioral economics. Massive data open up exciting new possibilities but present daunting challenges. For example, given electronic health-care records for 100 million people, can we ascertain which drugs cause which side effects? The literature on causal modeling has expanded greatly in recent years, but causal modeling in massive data has attracted little attention.