Confusing Terms in Data Science – A Look at Homonyms and more

Homonyms (words with multiple meanings):

Bias: To a lay person, bias refers to an opinion about something that is pre-formed in advance of specific facts. As consideration of ethical issues in data science grows, this meaning has crept into discussion of the fairness or social worth of machine learning algorithms. But the term has a more narrow definition in statistics – it refers to the tendency of an estimation procedure, or a model, to arrive at estimates or predictions that are, on balance, off target.

Confidence: To a statistician, confidence measures sample reliability (we are 95% confident that the average blood sugar in the group lies between X and Y, based on a sample of N patients). To a machine learner, confidence can refer to a metric used in association rules (“what goes with what in market basket transactions”), one of several measures of the strength of a rule.

Decision Trees: To statisticians and machine learners, “decision trees,” also called “classification and regression trees” (CART), is a term for a class of algorithms that progressively partition data into chunks that are more and more homogeneous with respect to the outcome variable. The result is a branching set of rules applied to predictor variables to predict the outcome. To an operations research specialist, “decision trees” are a representation of progressive decisions and possible outcomes, with probabilities, plus costs/benefits, attached to the outcomes. The path ending in the highest expected value then guides decisions.

Graph: To a lay person, a graph usually means a visual representation of data, which statisticians more often refer to as plots and charts. To computer scientist, graph refers to a data structure of entities? ties and links between them. Speaking of graphs, Wikipedia has an interesting Venn-style diagram of homonyms, synonyms, homographs and their cousins (right).

Normalize: In statistics and machine learning, to normalize a variable is to rescale it, so that it is on the same scale as other variables to be used in a model. For example, to subtract the mean, so it is centered around 0, and to divide by the standard deviation, so that it has a consistent scale with other variables so normalized. In database management, normalization refers to the process of organizing relational databases and their tables so that the data are not redundant and relations among tables are consistent.

Sample: In statistics, a sample is a collection of observations or records. In computer science and machine learning, sample often refers to a single record.

Synonyms (different words for the same thing)

Record: The prevalent non-time-series data format is the spreadsheet model, where each column is a variable, and each row is a record. So a row might represent a patient, for example, and the cell values are measurements on variables. Statisticians will also call the record a case, or an observation. In computer science, the terms instance, sample, or example might be used.

Prediction: In statistical and machine learning, prediction is the use of a model to predict individual outcomes on the basis of known predictor variables. The term “estimation” is also used, though its use is generally limited to numeric outcomes (as opposed to categorical or binary). In statistics, estimation more often refers to the use of a sample statistic (say, the mean) to measure something, and we want to interpret this measurement as representing a larger population.

Predictor variable: In computer science and machine learning this can be called an attribute, input variable or feature. In classical statistics, the term “independent variable” is used, and in database management the term “field” is applied. In artificial intelligence applications, models must typically start with very low level predictor information, such as pixel values or sound wavelengths. The term “feature” is used here to mean more than simply a given predictor variable, but also to the process of developing aggregations of low-level predictors into more informative “features” (also called “higher level features.”)

Data partitions: In predictive modeling, models are trained on data where the outcome is known. To assess the performance of those models, a portion of the data is set aside and the model is used to predict values that can be compared to the known values in this set-aside data. Sometimes, particularly where there is a lot of iteration between the set-aside and the training data to “tune” model parameters and select the best model, a third set-aside is used just to predict how well the model will do with new data. These set-asides have different names, not necessarily denoting which function they are serving: holdout data, test data, validation data.