Skip to content

Predictor P-Values in Predictive Modeling

Not So Useful

Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value – they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value.  They are of limited utility in predictive modeling applications for various reasons:

  • Software typically reports the p-value for in-sample (training) data, while in most predictive modeling applications you want to assess model performance on holdout data
  • They are often misinterpreted as measuring the importance of the predictor, or the probability that the model fits the data (neither is the case)

There is one predictive modeling context in which they can be useful:  eliminating variables, to reduce the dimensionality of the data. High p-values (say above 0.20) are a good sign that the predictor’s contribution to a model is not much greater than random chance.