Confusing Terms in Data Science – A Look at Synonyms

Synonyms (different words for the same thing)

Record: The prevalent non-time-series data format is the spreadsheet model, where each column is a variable, and each row is a record. So a row might represent a patient, for example, and the cell values are measurements on variables. Statisticians will also call the record a case, or an observation. In computer science, the terms instance, sample, or example might be used.

Prediction: In statistical and machine learning, prediction is the use of a model to predict individual outcomes on the basis of known predictor variables. The term “estimation” is also used, though its use is generally limited to numeric outcomes (as opposed to categorical or binary). In statistics, estimation more often refers to the use of a sample statistic (say, the mean) to measure something, and we want to interpret this measurement as representing a larger population.

Predictor variable: In computer science and machine learning this can be called an attribute, input variable or feature. In classical statistics, the term “independent variable” is used, and in database management the term “field” is applied. In artificial intelligence applications, models must typically start with very low level predictor information, such as pixel values or sound wavelengths. The term “feature” is used here to mean more than simply a given predictor variable, but also to the process of developing aggregations of low-level predictors into more informative “features” (also called “higher level features.”)

Data partitions: In predictive modeling, models are trained on data where the outcome is known. To assess the performance of those models, a portion of the data is set aside and the model is used to predict values that can be compared to the known values in this set-aside data. Sometimes, particularly where there is a lot of iteration between the set-aside and the training data to “tune” model parameters and select the best model, a third set-aside is used just to predict how well the model will do with new data. These set-asides have different names, not necessarily denoting which function they are serving: holdout data, test data, validation data.