Ten impressions of big data: Claims, aspirations, hardly any causal inference

“Big data” is all the rage. I am curious what people think big data can do, and what some claim it will do, for health and health care. I’m curious how people think causal connections will arise from (or using) big data. In large part this consideration seems overlooked, as if credible causal inferences will just emerge from the data, announcing themselves, dripping wet with self-evident validity. I am concerned.

I’ve been collecting excerpts of articles on big data, many sent to me by Darius Tahir, whom I thank. What I’ve compiled to date is below and in no particular order. For each piece, the author (with link to original) is indicated, followed by a quote. In many cases, what’s quoted is not an expression of the author’s views, but a characterization of the views of individuals about whom the author is reporting. I encourage you to click through for details before jumping to conclusions about who holds what view.

Also, do not interpret these as suggesting I do not see promise in big data. I do! I just think how we use data matters just as much as, if not more than, how much data we have. We should marry “big data” with “smart analysis” not just “big claims.”

1. Bill Gardner has not overlooked causal inference:

Here’s where the ‘big data’ movement comes in. We can assemble data sets with large numbers of patients from electronic health records (EHRs). Moreover, EHRs contain myriad demographic and clinical facts about these patients. It is proposed that with these large and rich data sets, we can match drug X and drug Y patients on clinically relevant variables sufficiently closely that the causal estimate of the difference between the effects of drug X and drug Y in the matched observational cohort would be similar to the estimate we would get if we had run an RCT.

2. David Shaywitz echos Bill and also notes the views of others that begin to shade toward the magical or mystical (“something will emerge”):*

Clinical utility, as Haddow and Palomaki write, “defines the risks and benefits associated with a test’s introduction into practice.” In other words, what’s the impact of using a particular assessment – how does it benefit patients, how might it adversely impact them? This may be easiest to think about in the context of consumer genetic tests suggesting you may be at slightly elevated risk for condition A, or slightly reduced risk for condition B: is this information (even if accurate) of any real value? […]

The other extreme, which Stanford geneticist Atul Butte is perhaps best known for advocating, is what might be called the data volume perspective; collect as much data as you possible can, the reasoning goes, and even if any individual aspect of it is sketchy or unreliable, these issues can be overcome with volume. If you examine enough parameters, interesting relationships are likely to emerge, and the goal is to not let the perfect be the enemy of the good enough. Create a database with all the information you can find, the logic goes, and something will emerge.

3. Darius Tahir reminds us that we’re most readily going to find correlations (implication: not causation) in a hypothesis-free space:

Supplementing medical data with consumer data might lead to better predictions, he, and the alliance, reasoned.

In the pilot program, the network will send its health data to a modeler, which will pair that information with consumer data, such as credit card and Google usage. The modeler doesn’t necessarily have a hypothesis going in, Cantor said.

“They’re identifying correlations between the consumer data and healthcare outcomes,” he said.

4. Amy Standen really frightens me with the scientific-method-is-dead idea:

“The idea here is, the scientific method itself is growing obsolete,” […]

[S]o much information will be available at our fingertips in the future that there will be almost no need for experiments. The answers are already out there. […]

Now, Butte says, “you can connect pre-term births from the medical records and birth census data to weather patterns, pollution monitors and EPA data to see is there a correlation there or not.” […]

Analyzing data is complicated and requires specific expertise. What if the search engine has bugs, or the records are transcribed incorrectly? There’s just too much room for error, she says.

“It’s going to take a system to interpret the data,” she says. “And that’s what we don’t have yet. We don’t have that system. We will, I mean for sure, the data is there, right? Now we have to develop the system to use it in a thoughtful, safe way.”

5. Chris Anderson says that numbers can speak for themselves:

Today companies like Google, which have grown up in an era of massively abundant data, don’t have to settle for wrong models. Indeed, they don’t have to settle for models at all. […]

With enough data, the numbers speak for themselves. […]

“Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. […]

Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

6. Bernie Monegain writes about about Partners HealthCare’s chief information officer James Noga’s dream of moving beyond prediction (for which correlations that aren’t causation can be useful) and designing interventions (for which causality is crucial):

He likes to employ a travel analogy. Drivers once got maps to travel from one point to another — they basically figured it out themselves — then they went to predictive analytics to find the best route to get from point A to point B.

“Then as you get into prescriptive analytics, it actually tells you on the way real time, an accident has happened and reroutes you,” said Noga.

“With big data you’re really talking about data that’s fast moving and perpetually occurring, actually able to intercede rather than merely advise in terms of the care of patients,” he said. “On the discovery side with genetics and genomics using external data sources, I think the possibilities of what I would call evidence-based medicine, and being able to drive that to drive better protocols on the clinical side is endless in terms of the possibilities.”

7. Veronique Greenwood offers concrete examples and a warning:

Back in her office, [Jennifer Frankovich] found that the scientific literature had no studies on patients like this to guide her. So she did something unusual: She searched a database of all the lupus patients the hospital had seen over the previous five years, singling out those whose symptoms matched her patient’s, and ran an analysis to see whether they had developed blood clots. “I did some very simple statistics and brought the data to everybody that I had met with that morning,” she says. The change in attitude was striking. “It was very clear, based on the database, that she could be at an increased risk for a clot.” […]

For his doctoral thesis, [Nicholas Tatonetti] mined the F.D.A.’s records of adverse drug reactions to identify pairs of medications that seemed to cause problems when taken together. He found an interaction between two very commonly prescribed drugs: The antidepressant paroxetine (marketed as Paxil) and the cholesterol-lowering medication pravastatin were connected to higher blood-sugar levels. Taken individually, the drugs didn’t affect glucose levels. But taken together, the side-effect was impossible to ignore. “Nobody had ever thought to look for it,” Tatonetti says, “and so nobody had ever found it.” […]

There are numerous correlations like this, and the reasons for them are still foggy — a problem Tatonetti and a graduate assistant, Mary Boland, hope to solve by parsing the data on a vast array of outside factors. Tatonetti describes it as a quest to figure out “how these diseases could be dependent on birth month in a way that’s not just astrology.” Other researchers think data-mining might also be particularly beneficial for cancer patients, because so few types of cancer are represented in clinical trials. […]

In the lab, ensuring that the data-mining conclusions hold water can also be tricky. By definition, a medical-records database contains information only on sick people who sought help, so it is inherently incomplete. Also, they lack the controls of a clinical study and are full of other confounding factors that might trip up unwary researchers. Daniel Rubin, a professor of bioinformatics at Stanford, also warns that there have been no studies of data-driven medicine to determine whether it leads to positive outcomes more often than not. Because historical evidence is of “inferior quality,” he says, it has the potential to lead care astray.

Yet despite the pitfalls, developing a “learning health system” — one that can incorporate lessons from its own activities in real time — remains tantalizing to researchers.

8. Vinod Khosla expresses some ambitions:

Technology will reinvent healthcare. Healthcare will become more scientific, holistic and consistent; delivering better-quality care with inexpensive data-gathering techniques and devices; continual monitoring and ubiquitous information leading to personalized, precise and consistent insights. New medical discoveries will be commonplace, and the practices we follow will be validated by more rigorous scientific methods. Although medical textbooks won’t be “wrong,” the current knowledge in them will be replaced by more precise and advanced methods, techniques and understandings.

Hundreds of thousands or even millions of data points will go into diagnosing a condition and, equally important, the continual monitoring of a therapy or prescription. […]

Over time, we will see a 5×5 improvement across healthcare: 5x reduction in doctors’ work (shifted to data-driven systems), 5x increase in research (due to the transformation to the “science of medicine”), 5x lower error rate (particularly in diagnostics), 5x faster diagnosis (through software apps) and 5x cost reduction.

9. Larry Page thinks government regulation is slowing the promise of big data:

I am really excited about the possibility of data also, to improve health. But that’s– I think what Sergey’s saying, it’s so heavily regulated. It’s a difficult area. I can give you an example. Imagine you had the ability to search people’s medical records in the U.S.. Any medical researcher can do it. Maybe they have the names removed. Maybe when the medical researcher searches your data, you get to see which researcher searched it and why. I imagine that would save 10,000 lives in the first year. Just that. That’s almost impossible to do because of HIPPA. I do worry that we regulate ourselves out of some really great possibilities that are certainly on the data-mining end.

10. Lindsey Cook writes about some of the barriers to big data (legal issues, physicians’ concerns, patients’ misunderstandings, technological barriers, misplaced research funding), though not about causal inference. Her piece includes a primer on what “big data” means (“an incredibly large amount of information”).

Big data is already producing research that has helped patients. For example, a data network for children with Crohn’s disease and ulcerative colitis called ImproveCareNow helped increase remission rates for sick children, according to Dr. Christopher Forrest and his colleagues, who are creating a national network of big data for children in the U.S.

* By Twitter, David points to his other work in this area, which I have not read at the time of this writing: here, here, and here.

@afrakt

Hidden information below

Subscribe

Email Address*