Word of the Week – Label Spreading

A common problem in machine learning is the “rare case” situation. In many classification problems, the class of interest (fraud, purchase by a web visitor, death of a patient) is rare enough that a data sample may not have enough instances to generate useful predictions.…

Comments Off on Word of the Week – Label Spreading

Word of the Week – Incidence versus Prevalence

Epidemiological terms are top of mind now, due to the pandemic. Here are two that often confuse: incidence and prevalence. For example, I encountered the following sentence on a popular medical web site: “Knee meniscal injuries are common with an incidence of 61 cases per…

Comments Off on Word of the Week – Incidence versus Prevalence

Words of the Week – Inference and Confidence

An often-overlooked basic part of learning new things is vocabulary: if you don’t fully understand the meaning of terms, you are handicapped. Worse, if you think you do understand, but that understanding is wrong, you are deprived of the ability to identify the gap in…

Comments Off on Words of the Week – Inference and Confidence

Word of the Week – Ruin Theory

The classic Gambler’s Ruin puzzle has an actuarial parallel:  “Ruin Theory,” the calculations that govern what an insurance company should charge in premiums to reduce the probability of “ruin” for a given insurance line.  “Ruin” means encountering claims that exhaust initial reserves plus accumulated premiums. …

Comments Off on Word of the Week – Ruin Theory

Word of the Week:  Bias

In this feature, we sometimes highlight terms that can have different meanings to different parts of the data science community, or in different contexts. Today’s term is “bias.” To the lay person, and to those worried about the ethical problems sometimes posed by the deployment…

Comments Off on Word of the Week:  Bias

Word of the Week – Entity Extraction

In Natural Language Processing (our course on the subject starts Jan 15), entity extraction is the process of labeling chunks of text as entities (e.g. people or organizations).  Consider this phrase from the blog on close elections linked above:   “the tie was not between Jefferson…

Comments Off on Word of the Week – Entity Extraction

Type III Error

Type I error in statistical analysis is incorrectly rejecting the null hypothesis - being fooled by random chance into thinking something interesting is happening.  The arcane machinery of statistical inference - significance testing and confidence intervals - was erected to avoid Type I error.  Type II error…

Comments Off on Type III Error

Relative Risk Ratio and Odds Ratio

The Relative Risk Ratio and Odds Ratio are both used to measure the medical effect of a treatment or variable to which people are exposed. The effect could be beneficial (from a therapy) or harmful (from a hazard).  Risk is the number of those having…

Comments Off on Relative Risk Ratio and Odds Ratio

Endpoint or Outcome (example: Covid-19 vaccine)

In a randomized experiment, the endpoint or outcome is a formal measure (statistic) of the result of the experiment.  In a randomized clinical trial preparatory to regulatory submission, there is often more than one outcome, due to the time and expense involved in conducting a…

Comments Off on Endpoint or Outcome (example: Covid-19 vaccine)

Link Function

In generalized linear models, a link function maps a nonlinear relationship to a linear one so that a linear model can be fit (and then mapped to the original form).  For example, in logistic regression, we want to find the probability of success:  P(Y =…

Comments Off on Link Function

Model Interpretability

Model interpretability refers to the ability for a human to understand and articulate the relationship between a model’s predictors and its outcome.  For linear models, including linear and logistic regression, these relationships are seen directly in the model coefficients.  For black-box models like neural nets,…

Comments Off on Model Interpretability

Polytomous

Polytomous, applied to variables (usually outcome variables), means multi-category (i.e. more than two categories).  Synonym:  multinomial. 

Comments Off on Polytomous

Bayesian Statistics

Bayesian statistics provides probability estimates of the true state of the world. An unremarkable statement, you might think -what else would statistics be for? But classical frequentist statistics, strictly speaking, only provide estimates of the state of a hothouse world, estimates that must be translated…

Comments Off on Bayesian Statistics

Density

As Covid-19 continues to spread, so will research on its behavior.  Models that rely mainly on time-series data will expand to cover relevant other predictors (covariates), and one such predictor will be gregariousness.  How to measure it?  In psychology there is the standard personality trait…

Comments Off on Density

Parameterized

Parameterized code in computer programs (or visualizations or spreadsheets) is code where the arguments being operated on are defined once as a parameter, at the beginning, so they do not have to be repeatedly explicitly defined in the body of the code.  This allows for…

Comments Off on Parameterized

Sensitivity and Specificity

We defined these terms already (see this blog), but how can you remember which is which, so you don’t have to look them up?  If you can remember the order in which to recite them - sensitivity then specificity, it’s easy.  Think “positive and negative”…

Comments Off on Sensitivity and Specificity

Decision Stumps

A decision stump is a decision tree with just one decision, leading to two or more leaves. For example, in this decision stump a borrower score of 0.475 or greater leads to a classification of “loan will default” while a borrower score less than 0.475…

Comments Off on Decision Stumps

R0 (R-nought)

For infectious diseases, R0 (R-nought) is the unimpeded replication rate of the disease pathogen in a naive (not immune) population.  An R0 of 2 means that each person with the disease infects two others.  Some things to keep in mind:    An R0 of one means…

Comments Off on R0 (R-nought)

Hazard

In biostatistics, hazard, or the hazard rate, is the instantaneous rate of an event (death, failure…).  It is the probability of the event occurring in a (vanishingly) small period of time, divided by the amount of time (mathematically it is the limit of this quantity…

Comments Off on Hazard

Standardized Death Rate

Often the death rate for a disease is fully known only for a group where the disease has been well studied.  For example, the 3711 passengers on the Diamond Princess cruise ship are, to date, the most fully studied coronavirus population.  All passengers were tested…

0 Comments

Regularized Model

In building statistical and machine learning models, regularization is the addition of penalty terms to predictor coefficients to discourage complex models that would otherwise overfit the data.  An example is ridge regression.

Comments Off on Regularized Model

Ridge Regression

Ridge regression is a method of penalizing coefficients in a regression model to force a more parsimonious model (one with fewer predictors) than would be produced by an ordinary least squares model. The term “ridge” was applied by Arthur Hoerl in 1970, who saw similarities…

Comments Off on Ridge Regression

Factor

The term “factor” has different meanings in statistics that can be confusing because they conflict.   In statistical programming languages like R, factor acts as an adjective, used synonymously with categorical - a factor variable is the same thing as a categorical variable.  These factor variables…

Comments Off on Factor

Purity

In classification, purity measures the extent to which a group of records share the same class.  It is also termed class purity or homogeneity, and sometimes impurity is measured instead.  The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where…

Comments Off on Purity

Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value - they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value.  They are of limited…

Comments Off on Predictor P-Values in Predictive Modeling

ROC, Lift and Gains Curves

There are various metrics for assessing the performance of a classification model.  It matters which one you use. The simplest is accuracy - the proportion of cases correctly classified.  In classification tasks where the outcome of interest (“1”) is rare, though, accuracy as a metric…

Comments Off on ROC, Lift and Gains Curves

Kernel function

In a standard linear regression, a model is fit to a set of data (the training data); the same linear model applies to all the data.  In local regression methods, multiple models are fit to different neighborhoods of the data. A kernel function is used…

Comments Off on Kernel function

Errors and Loss

Errors - differences between predicted values and actual values, also called residuals - are a key part of statistical models.  They form the raw material for various metrics of predictive model performance (accuracy, precision, recall, lift, etc.), and also the basis for diagnostics on descriptive…

Comments Off on Errors and Loss

Latin hypercube

In Monte Carlo sampling for simulation problems, random values are generated from a probability distribution deemed appropriate for a given scenario (uniform, poisson, exponential, etc.).  In simple random sampling, each potential random value within the probability distribution has an equal value of being selected. Just…

Comments Off on Latin hypercube

Regularize

The art of statistics and data science lies, in part, in taking a real-world problem and converting it into a well-defined quantitative problem amenable to useful solution. At the technical end of things lies regularization. In data science this involves various methods of simplifying models,…

Comments Off on Regularize

Intervals (confidence, prediction and tolerance)

All students of statistics encounter confidence intervals.  Confidence intervals tell you, roughly, the interval within which you can be, say, 95% confident that the true value of some sample statistic lies.  This is not the precise technical definition, but it is how people use the…

Comments Off on Intervals (confidence, prediction and tolerance)

Lift, Uplift, Gains

There are various metrics for assessing how well a model does, and one favored by marketers is lift, which is particularly relevant for the portion of the records predicted to be most profitable, most likely to buy, etc. 

Comments Off on Lift, Uplift, Gains

Probability

You might be wondering why such a basic word as probability appears here. It turns out that the term has deep tendrils in formal mathematics and philosophy, but is somewhat hard to pin down

Comments Off on Probability

Density

Density is a metric that describes how well-connected a network is

Comments Off on Density

Algorithms

We have an extensive statistical glossary and have been sending out a "word of the week" newsfeed for a number of years.  Take a look at the results

Comments Off on Algorithms

Gittens Index

Consider the multi-arm bandit problem where each arm has an unknown probability of paying either 0 or 1, and a specified payoff discount factor of x (i.e. for two successive payoffs, the second is valued at x% of the first, where x < 100%).  The Gittens index is [...]

Comments Off on Gittens Index

Cold Start Problem

There are various ways to recommend additional products to an online purchaser, and the most effective ones rely on prior purchase or rating history -

Comments Off on Cold Start Problem

Autoregressive

Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.

Comments Off on Autoregressive

Tensor

A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor). 

Comments Off on Tensor

Confusing Terms in Data Science – A Look at Synonyms

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing  synonyms, like these:

Comments Off on Confusing Terms in Data Science – A Look at Synonyms

Confusing Terms in Data Science – A Look at Homonyms and more

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these: 

 

 

Comments Off on Confusing Terms in Data Science – A Look at Homonyms and more

Jaquard’s coefficient

When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records.  One of them is the "yacht owner" problem.

Comments Off on Jaquard’s coefficient

Rectangular data

Rectangular data are the staple of statistical and machine learning models.  Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.

Comments Off on Rectangular data

Selection Bias

Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample.  It can occur in numerous situations, here are just a few:

Comments Off on Selection Bias

Likert Scale

A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly."  Two key decisions the survey designer faces are

  • How many gradients to allow, and

  • Whether to include a neutral midpoint

Comments Off on Likert Scale

Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

Comments Off on Dummy Variable

Curbstoning

Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb.  With a pretense of being an individual selling a car on his or her own, and with no fixed…

Comments Off on Curbstoning

Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

Comments Off on Snowball Sampling

Conditional Probability Word of the Week

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud").  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

Comments Off on Conditional Probability Word of the Week

Churn

Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, "churns."

Comments Off on Churn

ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:

 

Comments Off on ROC Curve

Prospective vs. Retrospective

A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you're measuring, who you're measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition…

Comments Off on Prospective vs. Retrospective

“out-of-bag,” as in “out-of-bag error”

"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)

Comments Off on “out-of-bag,” as in “out-of-bag error”

BOOTSTRAP

I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

Comments Off on BOOTSTRAP

Same thing, different terms..

The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.

 

Comments Off on Same thing, different terms..

BENFORD’S LAW

Benford's law describes an expected distribution of the first digit in many naturally-occurring datasets.

Comments Off on BENFORD’S LAW

HYPERPARAMETER

Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.

Comments Off on HYPERPARAMETER

SAMPLE

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

Comments Off on SAMPLE

SPLINE

 

The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables. 

Comments Off on SPLINE

NLP

To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

Comments Off on NLP

OVERFIT

As applied to statistical models - "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

Comments Off on OVERFIT

Week #18 – n

In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Comments Off on Week #18 – n

Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

Comments Off on Week #17 – Corpus

Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

Comments Off on Week #2 – Casual Modeling

Week #10 – Arm

In an experiment, an arm is a treatment protocol - for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists…

Comments Off on Week #10 – Arm

Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches - columns representing search terms and rows representing searches,…

Comments Off on Week #9 – Sparse Matrix

Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records…

Comments Off on Week #8 – Homonyms department: Sample

Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are…

Comments Off on Week #7 – Homonyms department: Normalization

Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Comments Off on Week #43 – HDFS

Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Comments Off on Week #42 – Kruskal – Wallis Test

Week #32 – False Discovery Rate

A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

Comments Off on Week #32 – False Discovery Rate

Week #23 – Netflix Contest

The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available.  Individuals and teams then compete to develop the best performing model.

Comments Off on Week #23 – Netflix Contest

Week #20 – R

This week's word is actually a letter.  R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.

Comments Off on Week #20 – R

Week #16 – Moving Average

In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.

Comments Off on Week #16 – Moving Average

Week #15 – Interaction term

In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.

Comments Off on Week #15 – Interaction term

Week #14 – Naive forecast

A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).

Comments Off on Week #14 – Naive forecast

week #9 – Overdispersion

In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.

Comments Off on week #9 – Overdispersion

Week #8 – Confusion matrix

In a classification model, the confusion matrix shows the counts of correct and erroneous classifications.  In a binary classification problem, the matrix consists of 4 cells.

Comments Off on Week #8 – Confusion matrix

Week #5 – Features vs. Variables

The predictors in a predictive model are sometimes given different terms by different disciplines.  Traditional statisticians think in terms of variables.

Comments Off on Week #5 – Features vs. Variables

Week # 52 – Quasi-experiment

In social science research, particularly in the qualitative literature on program evaluation, the term "quasi-experiment" refers to studies that do not involve the application of treatments via random assignment of subjects.

Comments Off on Week # 52 – Quasi-experiment

Week #48 – Structured vs. unstructured data

Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).

Comments Off on Week #48 – Structured vs. unstructured data

Word #39 – Censoring

Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.

Comments Off on Word #39 – Censoring

Work #32 – Predictive modeling

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).

Comments Off on Work #32 – Predictive modeling

Week #29 – Goodness-of-fit

Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which

Comments Off on Week #29 – Goodness-of-fit

Week #53 – Effect size

In a study or experiment with two groups (usually control and treatment), the investigator typically has in mind the magnitude of the difference between the two groups that he or she wants to be able to detect in a hypothesis test.

Comments Off on Week #53 – Effect size

Week #51 – Type 1 error

In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not.

Comments Off on Week #51 – Type 1 error

Week #49 – Data partitioning

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).

Comments Off on Week #49 – Data partitioning

Week #42 – Cross-sectional data

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual.  A simple...

Comments Off on Week #42 – Cross-sectional data

Week #32 – CHAID

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

Comments Off on Week #32 – CHAID

Week # 29 – Training data

Also called the training sample, training set, calibration sample.  The context is predictive modeling (also called supervised data mining) -  where you have data with multiple predictor variables and a single known outcome or target variable.

Comments Off on Week # 29 – Training data

Churn Trigger

Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.

Comments Off on Churn Trigger