## Confusing Terms in Data Science – A Look at Homonyms and more Synonyms

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these:

## Jaquard’s coefficient

When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records.  One of them is the “yacht owner” problem.

## Rectangular data

Rectangular data are the staple of statistical and machine learning models.  Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.

## Selection Bias

Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample.  It can occur in numerous situations, here are just a few:

## Likert Scale

A “likert scale” is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from “agree strongly” through “agree somewhat” and “disagree somewhat” on to “disagree strongly.”  Two key decisions the survey designer faces are

• How many gradients to allow, and

• Whether to include a neutral midpoint

## Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

## Curbstoning

Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb.  With a pretense of being an individual selling a car on his or her own, and with no fixed location, such dealers avoid the fixed costs and regulations thatContinue reading “Curbstoning”

## Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

## Conditional Probability Word of the Week

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as “fraud”).  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

## Churn

Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, “churns.”

## ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:

## Prospective vs. Retrospective

A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you’re measuring, who you’re measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition of the problem under study does not change once theContinue reading “Prospective vs. Retrospective”

## “out-of-bag,” as in “out-of-bag error”

“Bag” refers to “bootstrap aggregating,” repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)

## BOOTSTRAP

I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

## Same thing, different terms..

The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.

## CONVOLUTION and TENSOR

Today’s Words of the Week are convolution and tensor, key components of deep learning.

## HYPERPARAMETER

Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.