Skip to content

Problem of the Week: Missing Data

Question: You have a supervised learning task with 30 predictors, in which 5% of the observations are missing.  The missing data are randomly distributed across variables and records. If your strategy for coping with missing data is to drop records with missing data, what proportion of the records will be dropped?  Is the assumption of random distribution reasonable?

Answer: The problem that the first variable in a record will be missing is 0.05, so the probability that it will be present is 0.95. The probability that the second variable will be present is, likewise, 0.95. The probability that the first and second variables will both be present is 0.95 * 0.95 or 0.9025.  The probability that the first, second, and third variables will all be present is 0.95 * 0.95 * 0.95 = 0.8574. And so on. The probability that all 30 variables will be present is 0.95^30 = 0.215 or 21.5%, meaning that there is a 78.5% probability that at least one variable will be missing, and the record must be omitted.  If each record has a 78.5% chance of being omitted, then, on average, 78.5% of the records will be dropped from the analysis.

The assumption of random distribution is not very reasonable.  Typically, missingness is concentrated in a limited number of variables and records.  If just one or two variables have a lot of missing values, they can be omitted from the analysis.  If a subset of records is missing a lot of values, this is often an indicator that there is something different about those records.  In either case, a derived variable that flags whether a record has data for the variable can have predictive power in a modeling task.