Glossary of statistical terms

facebook LinkedIn twitter Google+ Email
Data Partition:

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set , the validation set , and the test set . If the data set is very large, often only a portion of it is selected for the partitions. Partitioning is normally used when the model for the data at hand is being chosen from a broad set of models. The basic idea of data partitioning is to keep a subset of available data out of analysis, and to use it later for verification of the model.

For example, a researcher developed a method for prediction of time series of stock prices data. The parameters of the model have been fitted to the available data, and the model demonstrates high prediction accuracy on these data. But this does not necessarily mean that the model will predict new data that well -- the model has been especially tuned to the characteristics (including random chance aspects) of the data used to fit it. Data partitioning is used to avoid such overly optimistic estimates of the model precision.

Data partitioning is normally used in supervised learning techniques in data mining where a predictive model is chosen from a set of models, using their performance on the training set as the validation of choice. Some examples of such techniques are classification trees , regression trees , neural networks , nonlinear variants of the discriminant analysis .

Browse Other Glossary Entries



Want to learn more about this topic?

Statistics.com offers over 100 courses in statistics from introductory to advanced level. Most are 4 weeks long and take place online in series of weekly lessons and assignments, requiring about 15 hours/week. Participate at your convenience; there are no set times when you must to be online. Ask questions and exchange comments with the instructor and other students on a private discussion board throughout the course.


Predictive Analytics 1 - Machine Learning Tools

This course covers the two core paradigms that account for most business applications of predictive modeling: classification and prediction. The course includes hands-on work with XLMiner, a data-mining add-in for Excel.


Predictive Analytics 3: Dimension Reduction, Clustering and Association Rules

This course covers key unsupervised learning techniques - association rules, principal components analysis, and clustering. The course will include an integration of supervised and unsupervised learning techniques.


Forecasting Analytics

This course will teach you how to choose an appropriate time series model, fit the model, to conduct diagnostics, and use the model for forecasting.



Back to Main Glossary

Promoting better understanding of statistics throughout the world

To celebrate the International Year of Statistics in 2013, we started a program to provide a statistical term every week, delivered directly to your inbox. The Word of the Week program proved to be quite popular, and continues. The Institute for Statistics Education offers an extensive glossary of statistical terms, available to all for reference and research. Make it your New Year's resolution to improve your own statistical knowledge! Sign up here. Rather not have more email? Simply bookmark our home page and check our “Stats Word of the Week” feature.

Want to be notified of future courses?

Yes
Student comments