There was an interesting article a couple of weeks ago in the New York Times magazine section on the role that Big Data can play in treating patients — discovering things that clinical trials are too slow, too expensive, and too blunt to find. The story was about a very particular set of lupus symptoms, and how a doctor, on a hunch, searched a large database and found that those symptoms were associated with an increased propensity for blood clots.
However, a search of the medical literature turned up nothing on the subject. What to do? The patient was treated with anticoagulant medication, and did not develop a blood clot. Of course, this does nothing to prove that the association was there in the first place. And on the flip side of the coin lies recent research about the non-replicability of scientific research.
A recent study looked at over 4 dozen health claims that researchers arrived at by examining existing data for possible associations – not by conducting controlled experiments. These 4 dozen claims all had one thing in common – they were tested later by controlled experiments. Astonishingly, not one of the claims held up in the controlled experiment.
Various reasons have been posited for the parlous state of scientific and medical research, including fraud and outright error, but a key issue is what statisticians call the “multiple comparisons problem.” Even in completely randomly-generated data, interesting patterns appear. If the data are big enough and the search exhaustive enough, the patterns can be very compelling.
So was the lupus association for real, or a fluke of Big Data? There’s no way to know, ex-post. The best we can do is to conduct what Lopiano, Obenchain and Youngcall “fair comparisons.” One principle is that the researcher should begin with an hypothesis to be tested, then proceed to test it on the available data, without letting any knowledge of the outcomes guide the analysis. This eliminates erroneous results that happen when you simply “look for something interesting until you find it.”
There remains the problem of hidden differences among patients, a problem that, in large controlled experiments, is effectively “washed out” by the random assignment process. Random assignment of treatment is not possible in observational data, so Lopiano et al propose the idea of clustering the patients into relatively homogeneous, and possibly quite small, clusters, where the effects of treatments can be examined for groups of similar patients. In this way, different treatment effects for different sorts of patients can be identified.
The moral? Rapid growth in the digitization and availability of patient data and health data in general holds great potential for medical research and personalized medicine. However, appropriate statistical methodology and sound study design are needed to unlock this potential, and guard against error.