Skip to content

Explore Courses | Elder Research | Contact | LMS Login

Statistics.com Logo
  • Courses
    • See All Courses
    • Calendar
    • Intro stats for college credit
    • Faculty
    • Group training
    • Credit & Credentialing
    • Teach With Us
  • Programs/Degrees
    • Certificates
      • Analytics for Data Science
      • Biostatistics
      • Programming For Data Science – Python (Experienced)
      • Programming For Data Science – Python (Novice)
      • Programming For Data Science – R (Experienced)
      • Programming For Data Science – R (Novice)
      • Social Science
    • Undergraduate Degree Programs
    • Graduate Degree Programs
    • Massive Open Online Courses (MOOC)
  • Partnerships
    • Higher Education
    • Enterprise
  • Resources
    • About Us
    • Blog
    • Word Of The Week
    • News and Announcements
    • Newsletter signup
    • Glossary
    • Statistical Symbols
    • FAQs & Knowledge Base
    • Testimonials
    • Test Yourself
Menu
  • Courses
    • See All Courses
    • Calendar
    • Intro stats for college credit
    • Faculty
    • Group training
    • Credit & Credentialing
    • Teach With Us
  • Programs/Degrees
    • Certificates
      • Analytics for Data Science
      • Biostatistics
      • Programming For Data Science – Python (Experienced)
      • Programming For Data Science – Python (Novice)
      • Programming For Data Science – R (Experienced)
      • Programming For Data Science – R (Novice)
      • Social Science
    • Undergraduate Degree Programs
    • Graduate Degree Programs
    • Massive Open Online Courses (MOOC)
  • Partnerships
    • Higher Education
    • Enterprise
  • Resources
    • About Us
    • Blog
    • Word Of The Week
    • News and Announcements
    • Newsletter signup
    • Glossary
    • Statistical Symbols
    • FAQs & Knowledge Base
    • Testimonials
    • Test Yourself
Student Login

Blog

Home Blog Healthcare Analytics: Exploration versus Confirmation

Healthcare Analytics: Exploration versus Confirmation

Perhaps the most active application of analytics and data mining is healthcare. This week we look at one success story, the use of machine learning to predict diabetic retinopathy, one story of disappointment, the use of genetic testing in a puzzling disease, and a basic dichotomy in statistical analysis.

In his famous 1977 book that introduced the idea of exploratory data analysis, John Tukey described two different strands of statistical analysis:

  • Exploration Exploratory Data Analysis

  • Confirmation

Tukey’s book, Exploratory Data Analysis, elevated the role of exploration, and he established the role of “data analyst” as opposed to statistician. Tukey was concerned with numerical summaries and plotting techniques that both simplify the story behind the data, and dig deeper to add understanding. Those techniques took on a vibrant life in statistics, particularly the plotting techniques that laid the foundation for the rich toolkit of data visualization techniques that is now available. He applied the term “confirmatory analysis” to the whole arena of statistical inference, with its complex set of formulas for hypothesis testing and confidence intervals.

Exploration is the process of looking at data in lots of different ways to see if there’s anything interesting going on. Confirmation is the process of validating that you’ve found something real, and not just random behavior. The best way to do this is to look at new data and see if the phenomenon holds up. We’ll keep this distinction in mind as we look at two cases in healthcare.

Diabetic Retinopathy and Deep Learning

Diabetes is the fastest growing cause of blindness. Over 400 million people worldwide have diabetes and are at risk for diabetic retinopathy and possible blindness. Diabetics are most likely to be on a regimen of regular monitoring of blood sugar, and frequent eye exams. Retinopathy, however, cannot be diagnosed with a quick exam of the eye; images must be taken and examined by a specialist – and in many parts of the world these specialists are few and far between. By the time image has been reviewed and diagnosed, the patient will have left the clinic, and the odds of getting them on an appropriate therapy regimen have plummeted.

In 2016, a team of researchers from Google and several universities published the results of a study in which deep learning was used to classify eye images and assign a probability of retinopathy, which was converted to a diagnosis by setting a cutoff point. This challenge had earlier been the subject of a Kaggle competition; the Google team, using those results as a point of departure, brought in more data and achieved results equivalent to those of trained specialists. Considering that a consensus of specialist evaluations was the basis for “ground truth” in the study, these are good results indeed.

This study was not an exploratory one; the goal was not to locate factors that might be associated with retinopathy. The purpose apriori was simply to identify retinopathy. The images were all labeled as to whether disease was present, and a holdout set was used to evaluate the algorithm, to be sure it was not finding chance artifacts.

The medical implications of the study are important – when the system is implemented, images can be evaluated immediately while the patient is in the clinic, and an appropriate therapy regimen started before the patient leaves.

Genetic Testing

The human genome was mapped in 2003, and the last 5 years have seen explosive growth in a completely new business – genetic tests. There are now over 75,000 such tests relating to different genes, and the race is on to find out what genes are associated with what disorders. There are close to 20,000 genes, and the tests typically focus on specific sets of genes in connection with particular disorders. This broad-scale undertaking is not a focused confirmatory study, it is exploration on a massive scale to find interesting correlations between genes, particularly genetic mutations, and diseases. There is little hope that targeted specific confirmatory studies (which can be expensive) will catch up to all the suggestions unearthed by the widespread genetic testing. In short, it is a recipe for lots of false positives.

This effect is illustrated in a Wall Street Journal story about a 4-year-old girl – Esme – afflicted with an unknown but debilitating circulatory and respiratory ailment. A genetic test in 2013 revealed a defect in the PCDH19 gene. The family dove deeply into research, and engagement with a small community of those suffering a similar defect. They established a foundation to fund research into PCDH19 defects. But in 2015, another genetic test suggested that PCDH19 was not at fault, rather SCN8A was the culprit. The family shifted their foundation’s research over to SCN8A. In 2016, the lab that did the 2015 testing issued a reinterpretation of the prior results. SCN8A’s significance was now considered uncertain, and two new gene variants were implicated. A few months ago, the lab again contacted the parents with word that a new test was available, incorporating the latest information. The repeating cycle of hopes raised and then dashed, pathways opened then closed, has been discouraging and draining for the parents.

The ability to process huge data sets and conduct exploratory statistical analysis “at volume,” leads to a proliferation of “findings” that are tantalizing but ephemeral. The significance of a “finding” is inverse to the amount of searching that had to take place to produce it. John Elder, the founder of the highly-regarded specialty data mining firm Elder Research, terms this the “vast search effect.”

Recent Posts

  • Oct 6: Ethical AI: Darth Vader and the Cowardly Lion
    /
    0 Comments
  • Oct 19: Data Literacy – The Chainsaw Case
    /
    0 Comments
  • Data Literacy – The Chainsaw Case
    /
    0 Comments

About Statistics.com

Statistics.com offers academic and professional education in statistics, analytics, and data science at beginner, intermediate, and advanced levels of instruction. Statistics.com is a part of Elder Research, a data science consultancy with 25 years of experience in data analytics.

 The Institute for Statistics Education is certified to operate by the State Council of Higher Education for Virginia (SCHEV)

Our Links

  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team
  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team

Social Networks

Facebook Twitter Youtube Linkedin

Contact

The Institute for Statistics Education
2107 Wilson Blvd
Suite 850 
Arlington, VA 22201
(571) 281-8817

ourcourses@statistics.com

  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team

© Copyright 2023 - Statistics.com, LLC | All Rights Reserved | Privacy Policy | Terms of Use

By continuing to use this website, you consent to the use of cookies in accordance with our Cookie Policy.

Accept