Skip to content
Statistics logo
  • Courses
    • See All Courses
    • Calendar
    • Intro stats for college credit
    • Faculty
    • Group training
    • Credit & Credentialing
    • Teach With Us
  • Programs/Degrees
    • Certificates
      • Analytics for Data Science
      • Biostatistics
      • Programming For Data Science – Python (Experienced)
      • Programming For Data Science – Python (Novice)
      • Programming For Data Science – R (Experienced)
      • Programming For Data Science – R (Novice)
      • Social Science
    • Skillsets
      • Bayesian Statistics
      • Business Analytics
      • Healthcare Analytics
      • Marketing Analytics
      • Operations Research
      • Predictive Analytics
      • Python Analytics
      • R Programming Analytics
      • Rasch & IRT
      • Spatial Statistics
      • Survey Analysis
      • Text Mining Analytics
    • Undergraduate Degree Programs
    • Graduate Degree Programs
  • Partnerships
    • Higher Education
    • Enterprise
  • Resources
    • About Us
    • Blog
    • Word Of The Week
    • Newsletter signup
    • Glossary
    • Statistical Symbols
    • FAQs & Knowledge Base
    • Testimonials
    • Test Yourself
  • Student Login

Home Blog Problem of the Week: Missing Data

Problem of the Week: Missing Data

Question: You have a supervised learning task with 30 predictors, in which 5% of the observations are missing.  The missing data are randomly distributed across variables and records. If your strategy for coping with missing data is to drop records with missing data, what proportion of the records will be dropped?  Is the assumption of random distribution reasonable?

Answer: The problem that the first variable in a record will be missing is 0.05, so the probability that it will be present is 0.95. The probability that the second variable will be present is, likewise, 0.95. The probability that the first and second variables will both be present is 0.95 * 0.95 or 0.9025.  The probability that the first, second, and third variables will all be present is 0.95 * 0.95 * 0.95 = 0.8574. And so on. The probability that all 30 variables will be present is 0.95^30 = 0.215 or 21.5%, meaning that there is a 78.5% probability that at least one variable will be missing, and the record must be omitted.  If each record has a 78.5% chance of being omitted, then, on average, 78.5% of the records will be dropped from the analysis.

The assumption of random distribution is not very reasonable.  Typically, missingness is concentrated in a limited number of variables and records.  If just one or two variables have a lot of missing values, they can be omitted from the analysis.  If a subset of records is missing a lot of values, this is often an indicator that there is something different about those records.  In either case, a derived variable that flags whether a record has data for the variable can have predictive power in a modeling task.

Subscribe to the Blog

You have Successfully Subscribed!

By submitting your information, you agree to receive email communications from statistics.com. All information submitted is subject to our privacy policy. You may opt out of receiving communications at any time.

Categories

Recent Posts

  • Table Test
  • Oct 19: Data Literacy – The Chainsaw Case
  • Data Literacy – The Chainsaw Case

About Statistics.com

Statistics.com offers academic and professional education in statistics, analytics, and data science at beginner, intermediate, and advanced levels of instruction. Statistics.com is a part of Elder Research, a data science consultancy with 25 years of experience in data analytics.

Our Links

  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team
Menu
  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team

Social Networks

Linkedin-in Twitter Facebook-f Youtube

Contact

The Institute for Statistics Education
2107 Wilson Blvd
Suite 850 
Arlington, VA 22201
(571) 281-8817

ourcourses@statistics.com

  • Contact Us
  • Site Map
  • Explore Courses
  • About Us
  • Management Team

© Copyright 2022 - Statistics.com, LLC | All Rights Reserved | Privacy Policy | Terms of Use

By continuing to use this website, you consent to the use of cookies in accordance with our Cookie Policy.

Accept