Glossary of statistical terms
Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set , the validation set , and the test set . If the data set is very large, often only a portion of it is selected for the partitions. Partitioning is normally used when the model for the data at hand is being chosen from a broad set of models. The basic idea of data partitioning is to keep a subset of available data out of analysis, and to use it later for verification of the model.
For example, a researcher developed a method for prediction of time series of stock prices data. The parameters of the model have been fitted to the available data, and the model demonstrates high prediction accuracy on these data. But this does not necessarily mean that the model will predict new data that well -- the model has been especially tuned to the characteristics (including random chance aspects) of the data used to fit it. Data partitioning is used to avoid such overly optimistic estimates of the model precision.
Data partitioning is normally used in supervised learning techniques in data mining where a predictive model is chosen from a set of models, using their performance on the training set as the validation of choice. Some examples of such techniques are classification trees , regression trees , neural networks , nonlinear variants of the discriminant analysis .
Want to learn more about this topic?
Statistics.com offers over 100 courses in statistics from introductory to advanced level. Most are 4 weeks long and take place online in series of weekly lessons and assignments, requiring about 15 hours/week. Participate at your convenience; there are no set times when you must to be online. Ask questions and exchange comments with the instructor and other students on a private discussion board throughout the course.
This course covers the two core paradigms that account for most business applications of predictive modeling: classification and prediction. The course includes hands-on work with XLMiner, a data-mining add-in for Excel.
This course covers key unsupervised learning techniques - association rules, principal components analysis, and clustering. The course will include an integration of supervised and unsupervised learning techniques.
This course will teach you how to choose an appropriate time series model, fit the model, to conduct diagnostics, and use the model for forecasting.