Skip to content

Same thing, different terms..

The most striking one is “sample” – to statisticians it means a dataset selected from a larger dataset.

To computer scientists and machine learners it often means a single observation or record in a dataset. Other synonyms for the single record include exampleinstancepattern (from machine learning), case (statistics) or row (database technology).

Outliers are also anomalies.

An outcome variable is also called a dependent variableresponse variable, or a target variable.

In predictive modeling, subsets of the data are typically withheld from the model fitting process, then the fitted model is tested on the withheld data. Those withheld data are called the holdout data, the validation data, or the test data. In some applications two extra datasets are withheld, one of them to be used only at the very end, to assess bias in the ultimate chosen model (not to do further tuning of models). This third set is, in the original SAS terminology, the test data.

In predictive modeling a classification model predicts what category a record belongs to; in biostatistics a diagnostic test performs a similar role. In biostatistics, sensitivity is the proportion of 1’s (cases of interest) correctly identified by the test. The same metric is called recall in predictive modeling.

The term decision trees, by contrast, is used in two very different ways. In predictive modeling, decision trees learned from data establish rules for predictor variables that can be used to predict unknown outcome variables. In decision analysis, a decision-maker would construct a branching decision tree to account for different possible outcomes to events, and their probabilities and costs or benefits, to identify the maximum expected value of a particular decision path.