OVERFIT

February 14, 2018Word of the Week

Standard linear regression is less prone to overfitting problems; the structured linear relationship does not allow the model to “bend” to accommodate noise. However, even with linear regression the resulting line (or multi-dimensional relationship) is tailored to provide the best fit to the data that it was fit to; it will probably not do as well with another set of data drawn from the same population.

Machine learning algorithms like neural nets and decision trees are the most vulnerable to overfitting. Left to their own devices, they can fit every point in the data just like the plot above – i.e. completely modeling all the noise in the data. So the question of how to stop algorithms like this at an appropriate stage looms large in their implementation.

A key tool in keeping overfitting under control is a holdout sample – a dataset drawn from the same population that was used to fit the model. This is a common procedure, particularly with problems involving plentiful (“big”) data. (Learn more in our online course Predictive Analytics 1.)