For this roundup of quotes, I received input from Darius Tahir and David Shaywitz. Prior TIE posts on big data are here. As always, the quotes below do not reflect the views of the authors, but those of the people or community they’re covering. Click through for details.
1. David Shaywitz in “Why Causation Is (Often) Not Causation – The Retro Humility Of Empiricism,” articulates the “strong version” of the “big data thesis”:
A strong version of the canonical big data thesis is that when you have enough information, you can make unbiased predictions that don’t require an underlying understanding of the process or context – the data are sufficient to speak for themselves. This is the so-called “end of theory.”
2. Darius Tahir reports on the content of a Rock Health slide deck:
Healthcare accelerator Rock Health is predicting big advances for startups and healthcare providers using personalized, predictive analytic tools. The firm has observed $1.9 billion in venture dollars pouring into the subsector since 2011, with major venture capital firms keeping active.
The use of predictive analytics, essentially looking at historic data to predict future developments to directly intervene in patient care, will only increase as data multiplies, the report argues.
In 2012, the healthcare system had stored roughly 500 petabytes of patient data, the equivalent of 10 billion four-drawer file cabinets full of information.
By 2020, the healthcare system is projected to store 50 times as much information, 25,000 petabytes, meaning machine intelligence will be essential to complement human intelligence to make sense of it all.
See, in particular, pages 18, 19 of the slide deck from Rock Health. I found it interesting that there’s lots of use of “prediction” and “predictive” throughout the deck and no direct language of causality. This is appropriate. I also think that organizations that didn’t understand just this limitation would slip into causal language now and then. In other words, Rock Health, and likely others, know exactly what they’re selling. (I am not disparaging prediction here. It is useful. I am merely distinguishing it from causal inference.)
3. Tim Hartford has written one of the best pieces on the limitations of big data I’ve read to date. Big data is often also “found data,” hence typically suffers from selection bias. It also invites a multiplicity of hypothesis testing; query the data enough and something (meaningless) will appear statistically significant, eventually. (More by David Shaywitz on this point here.) I recommend reading his piece in full; it includes many examples from Google, Twitter, Target, the city of Boston, and the history of polling. Here’s an excerpt, cobbled from snippets throughout:
Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends [which Hartford summarizes, as well as its later comeuppance]: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”. [I quoted from and linked to that Wired article here.]
Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks. Absolute nonsense.” […]
A recent report from the McKinsey Global Institute reckoned that the US healthcare system could save $300bn a year – $1,000 per American – through better integration and analysis of the data produced by everything from clinical trials to health insurance transactions to smart running shoes. […]
“There are a lot of small data problems that occur in big data,” says Spiegelhalter. “They don’t disappear because you’ve got lots of the stuff. They get worse.” […]
The Literary Digest, in its quest for a bigger data set, fumbled the question of a biased sample. It mailed out forms to people on a list it had compiled from automobile registrations and telephone directories – a sample that, at least in 1936, was disproportionately prosperous. To compound the problem, Landon supporters turned out to be more likely to mail back their answers. The combination of those two biases was enough to doom The Literary Digest’s poll. For each person George Gallup’s pollsters interviewed, The Literary Digest received 800 responses. All that gave them for their pains was a very precise estimate of the wrong answer.
The big data craze threatens to be The Literary Digest all over again. Because found data sets are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is. […]
[B]ig data do not solve the problem that has obsessed statisticians and scientists for centuries: the problem of insight, of inferring what is going on, and figuring out how we might intervene to change a system for the better. […]
Statisticians are scrambling to develop new methods to seize the opportunity of big data. Such new methods are essential but they will work by building on the old statistical lessons, not by ignoring them.
4. David Shaywitz in “Turning Information Into Impact: Digital Health’s Long Road Ahead”:
A leading scientist once claimed that, with the relevant data and a large enough computer, he could “compute the organism” – meaning completely describe its anatomy, physiology, and behavior. Another legendary researcher asserted that, following capture of the relevant data, “we will know what it is to be human.” The breathless excitement of Sydney Brenner and Walter Gilbert —voiced more than a decade ago and captured by the skeptical Harvard geneticist Richard Lewontin – was sparked by the sequencing of the human genome. Its echoes can be heard in the bold promises made for digital health today. […]
[T]echnologists, investors, providers, and policy makers all exalt the potential of digital health. Like genomics, the big idea – or leap of faith — is that through the more complete collection and analysis of data, we’ll be able to essentially “compute” healthcare – to the point, some envision, where computers will become the care providers, and doctors will at best be customer service personnel, like the attendants at PepBoys, interfacing with libraries of software driven algorithms.
5. David Shaywitz in “A Database of All Medical Knowledge: Why Not?” writes about the challenges of finding and assembling big data. Here’s the set-up:
For scientists and engineers today, perhaps the greatest challenge is the structure and assembly of a unified health database, a “big data” project that would collect in one searchable repository all of the parameters that measure or could conceivably reflect human well-being. This database would be “coherent,” meaning that the association between individuals and their data is preserved and maintained. A recent Institute of Medicine (IOM) report described the goal as a “Knowledge Network of Disease,” a “unifying framework within which basic biology, clinical research, and patient care could co-evolve.”
The information contained in this database — expected to get denser and richer over time — would encompass every conceivable domain, covering patients (DNA, microbiome, demographics, clinical history, treatments including therapies prescribed and estimated adherence, lab tests including molecular pathology and biomarkers, info from mobile devices, even app use), providers (prescribing patterns, treatment recommendations, referral patterns, influence maps, resource utilization), medical product companies (clinical trial data), payors (claims data), diagnostics companies, electronic medical record companies, academic researchers, citizen scientists, quantified selfers, patient communities – and this just starts to scratch the surface.