Often the death rate for a disease is fully known only for a group where the disease has been well studied. For example, the 3711 passengers on the Diamond Princess cruise ship are, to date, the most fully studied coronavirus population. All passengers were tested and tracked by health authorities, and the death rate wasContinue reading “Standardized Death Rate”

# Category Archives: Word of the Week

## Regularized Model

In building statistical and machine learning models, regularization is the addition of penalty terms to predictor coefficients to discourage complex models that would otherwise overfit the data. An example is ridge regression.

## Ridge Regression

Ridge regression is a method of penalizing coefficients in a regression model to force a more parsimonious model (one with fewer predictors) than would be produced by an ordinary least squares model. The term “ridge” was applied by Arthur Hoerl in 1970, who saw similarities to the ridges of quadratic response functions. In ordinary leastContinue reading “Ridge Regression”

## Factor

The term “factor” has different meanings in statistics that can be confusing because they conflict. In statistical programming languages like R, factor acts as an adjective, used synonymously with categorical – a factor variable is the same thing as a categorical variable. These factor variables have levels, which are the same thing as categories (aContinue reading “Factor”

## Purity

In classification, purity measures the extent to which a group of records share the same class. It is also termed class purity or homogeneity, and sometimes impurity is measured instead. The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where p = the proportion of records belonging to class 1. Continue reading “Purity”

## Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value – they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value. They are of limited utility in predictive modeling applications for various reasons: Software typicallyContinue reading “Predictor P-Values in Predictive Modeling”

## ROC, Lift and Gains Curves

There are various metrics for assessing the performance of a classification model. It matters which one you use. The simplest is accuracy – the proportion of cases correctly classified. In classification tasks where the outcome of interest (“1”) is rare, though, accuracy as a metric falls short – high accuracy can be achieved by classifyingContinue reading “ROC, Lift and Gains Curves”

## Kernel function

In a standard linear regression, a model is fit to a set of data (the training data); the same linear model applies to all the data. In local regression methods, multiple models are fit to different neighborhoods of the data. A kernel function is used to determine the contribution of the “neighborhood data” to theContinue reading “Kernel function”

## Errors and Loss

Errors – differences between predicted values and actual values, also called residuals – are a key part of statistical models. They form the raw material for various metrics of predictive model performance (accuracy, precision, recall, lift, etc.), and also the basis for diagnostics on descriptive models. A related concept is loss, which is some functionContinue reading “Errors and Loss”

## Latin hypercube

In Monte Carlo sampling for simulation problems, random values are generated from a probability distribution deemed appropriate for a given scenario (uniform, poisson, exponential, etc.). In simple random sampling, each potential random value within the probability distribution has an equal value of being selected. Just due to the vagaries of random chance, clusters of similarContinue reading “Latin hypercube”

## Regularize

The art of statistics and data science lies, in part, in taking a real-world problem and converting it into a well-defined quantitative problem amenable to useful solution. At the technical end of things lies regularization. In data science this involves various methods of simplifying models, to minimize overfitting and better reveal underlying phenomena. Some examplesContinue reading “Regularize”

## Intervals (confidence, prediction and tolerance)

## Probability

You might be wondering why such a basic word as probability appears here. It turns out that the term has deep tendrils in formal mathematics and philosophy, but is somewhat hard to pin down

## Density

Density is a metric that describes how well-connected a network is

## Algorithms

We have an extensive statistical glossary and have been sending out a “word of the week” newsfeed for a number of years. Take a look at the results

## Gittens Index

Consider the multi-arm bandit problem where each arm has an unknown probability of paying either 0 or 1, and a specified payoff discount factor of x (i.e. for two successive payoffs, the second is valued at x% of the first, where x < 100%). The Gittens index is […]

## Cold Start Problem

There are various ways to recommend additional products to an online purchaser, and the most effective ones rely on prior purchase or rating history –

## Autoregressive

Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.

## Tensor

A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor).

## Confusing Terms in Data Science – A Look at Synonyms

To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing synonyms, like these: