Week #49 – Data partitioning

October 25, 2013Word of the Week

The latter two partitions are also called hold-out samples or holdout partitions. If the data set is very large, often only a portion of it is selected for the partitions. Partitioning is normally used when the model for the data at hand is being chosen from a broad set of models. The basic idea of data partitioning is to keep a subset of available data out of analysis, and to use it later for verification of the model.

For example, a researcher developed a method for prediction of time series of stock prices data. The parameters of the model have been fitted to the available data, and the model demonstrates high prediction accuracy on these data. But this does not necessarily mean that the model will predict new data that well — the model has been especially tuned to the characteristics (including random chance aspects) of the data used to fit it. Data partitioning is used to avoid such overly optimistic estimates of the model precision.

Data partitioning is normally used in supervised learning techniques in data mining where a predictive model is chosen from a set of models, using their performance on the validation set as the criterion of choice.