## Subscribe to the Weekly Briefing

The Weekly Briefing has current articles, Word of the Week, historical notes, spotlights on jobs and industries, student profiles and more.

## You have Successfully Subscribed!

#### Bayesian Statistics

Bayesian statistics provides probability estimates of the true state of the world. An unremarkable statement, you might think -what else would statistics be for? But classical frequentist statistics, strictly speaking, only provide estimates of the state of a hothouse world, estimates that must be translated…

#### Density

As Covid-19 continues to spread, so will research on its behavior.  Models that rely mainly on time-series data will expand to cover relevant other predictors (covariates), and one such predictor will be gregariousness.  How to measure it?  In psychology there is the standard personality trait…

#### Parameterized

Parameterized code in computer programs (or visualizations or spreadsheets) is code where the arguments being operated on are defined once as a parameter, at the beginning, so they do not have to be repeatedly explicitly defined in the body of the code.  This allows for…

#### Sensitivity and Specificity

We defined these terms already (see this blog), but how can you remember which is which, so you don’t have to look them up?  If you can remember the order in which to recite them - sensitivity then specificity, it’s easy.  Think “positive and negative”…

Comments Off on Sensitivity and Specificity

#### Decision Stumps

A decision stump is a decision tree with just one decision, leading to two or more leaves. For example, in this decision stump a borrower score of 0.475 or greater leads to a classification of “loan will default” while a borrower score less than 0.475…

#### R0 (R-nought)

For infectious diseases, R0 (R-nought) is the unimpeded replication rate of the disease pathogen in a naive (not immune) population.  An R0 of 2 means that each person with the disease infects two others.  Some things to keep in mind:    An R0 of one means…

#### Hazard

In biostatistics, hazard, or the hazard rate, is the instantaneous rate of an event (death, failure…).  It is the probability of the event occurring in a (vanishingly) small period of time, divided by the amount of time (mathematically it is the limit of this quantity…

#### Standardized Death Rate

Often the death rate for a disease is fully known only for a group where the disease has been well studied.  For example, the 3711 passengers on the Diamond Princess cruise ship are, to date, the most fully studied coronavirus population.  All passengers were tested…

#### Regularized Model

In building statistical and machine learning models, regularization is the addition of penalty terms to predictor coefficients to discourage complex models that would otherwise overfit the data.  An example is ridge regression.

#### Ridge Regression

Ridge regression is a method of penalizing coefficients in a regression model to force a more parsimonious model (one with fewer predictors) than would be produced by an ordinary least squares model. The term “ridge” was applied by Arthur Hoerl in 1970, who saw similarities…

#### Factor

The term “factor” has different meanings in statistics that can be confusing because they conflict.   In statistical programming languages like R, factor acts as an adjective, used synonymously with categorical - a factor variable is the same thing as a categorical variable.  These factor variables…

#### Purity

In classification, purity measures the extent to which a group of records share the same class.  It is also termed class purity or homogeneity, and sometimes impurity is measured instead.  The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where…

#### Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value - they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value.  They are of limited…

Comments Off on Predictor P-Values in Predictive Modeling

#### ROC, Lift and Gains Curves

There are various metrics for assessing the performance of a classification model.  It matters which one you use. The simplest is accuracy - the proportion of cases correctly classified.  In classification tasks where the outcome of interest (“1”) is rare, though, accuracy as a metric…

Comments Off on ROC, Lift and Gains Curves

#### Kernel function

In a standard linear regression, a model is fit to a set of data (the training data); the same linear model applies to all the data.  In local regression methods, multiple models are fit to different neighborhoods of the data. A kernel function is used…

#### Errors and Loss

Errors - differences between predicted values and actual values, also called residuals - are a key part of statistical models.  They form the raw material for various metrics of predictive model performance (accuracy, precision, recall, lift, etc.), and also the basis for diagnostics on descriptive…

Comments Off on Errors and Loss

#### Latin hypercube

In Monte Carlo sampling for simulation problems, random values are generated from a probability distribution deemed appropriate for a given scenario (uniform, poisson, exponential, etc.).  In simple random sampling, each potential random value within the probability distribution has an equal value of being selected. Just…

#### Regularize

The art of statistics and data science lies, in part, in taking a real-world problem and converting it into a well-defined quantitative problem amenable to useful solution. At the technical end of things lies regularization. In data science this involves various methods of simplifying models,…

#### Intervals (confidence, prediction and tolerance)

All students of statistics encounter confidence intervals.  Confidence intervals tell you, roughly, the interval within which you can be, say, 95% confident that the true value of some sample statistic lies.  This is not the precise technical definition, but it is how people use the…

Comments Off on Intervals (confidence, prediction and tolerance)

#### Lift, Uplift, Gains

There are various metrics for assessing how well a model does, and one favored by marketers is lift, which is particularly relevant for the portion of the records predicted to be most profitable, most likely to buy, etc.

Comments Off on Lift, Uplift, Gains

#### Probability

You might be wondering why such a basic word as probability appears here. It turns out that the term has deep tendrils in formal mathematics and philosophy, but is somewhat hard to pin down

#### Density

Density is a metric that describes how well-connected a network is

#### Algorithms

We have an extensive statistical glossary and have been sending out a "word of the week" newsfeed for a number of years.  Take a look at the results

#### Gittens Index

Consider the multi-arm bandit problem where each arm has an unknown probability of paying either 0 or 1, and a specified payoff discount factor of x (i.e. for two successive payoffs, the second is valued at x% of the first, where x < 100%).  The Gittens index is [...]

#### Cold Start Problem

There are various ways to recommend additional products to an online purchaser, and the most effective ones rely on prior purchase or rating history -

Comments Off on Cold Start Problem

#### Autoregressive

Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.

#### Tensor

A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor).

#### Confusing Terms in Data Science – A Look at Synonyms

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing  synonyms, like these:

Comments Off on Confusing Terms in Data Science – A Look at Synonyms

#### Confusing Terms in Data Science – A Look at Homonyms and more

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these:

Comments Off on Confusing Terms in Data Science – A Look at Homonyms and more

#### Jaquard’s coefficient

When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records.  One of them is the "yacht owner" problem.

#### Rectangular data

Rectangular data are the staple of statistical and machine learning models.  Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.

#### Selection Bias

Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample.  It can occur in numerous situations, here are just a few:

#### Likert Scale

A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly."  Two key decisions the survey designer faces are

• How many gradients to allow, and

• Whether to include a neutral midpoint

#### Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

#### Curbstoning

Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb.  With a pretense of being an individual selling a car on his or her own, and with no fixed…

#### Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

#### Conditional Probability

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud").  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

#### Churn

Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, "churns."

#### ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:

#### Prospective vs. Retrospective

A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you're measuring, who you're measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition…

Comments Off on Prospective vs. Retrospective

#### “out-of-bag,” as in “out-of-bag error”

"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)

Comments Off on “out-of-bag,” as in “out-of-bag error”

#### BOOTSTRAP

I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

#### Same thing, different terms..

The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.

Comments Off on Same thing, different terms..

#### CONVOLUTION and TENSOR

Today's Words of the Week are convolution and tensor, key components of deep learning.

Comments Off on CONVOLUTION and TENSOR

#### BENFORD’S LAW

Benford's law describes an expected distribution of the first digit in many naturally-occurring datasets.

#### CONTINGENCY TABLES

Contingency tables are tables of counts of events or things, cross-tabulated by row and column.

#### HYPERPARAMETER

Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.

#### SAMPLE

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

#### SPLINE

The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables.

#### NLP

To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

#### OVERFIT

As applied to statistical models - "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

#### Week #24 – Logit

Logit is a nonlinear function of probability. If p is the probability of an event, then the corresponding logit is given by the formula: logit(p) = log  p (1 - p)   Logit is widely used to construct statistical models, for example in logistic regression.

Comments Off on Week #24 – Logit

#### Week #23 – Intraobserver Reliability

Intraobserver reliability indicates how stable are responses obtained from the same respondent at different time points. The greater the difference between the responses, the smaller the intraobserver reliability of the survey instrument. The correlation coefficient between the responses obtained at different time points from the same respondent is often…

Comments Off on Week #23 – Intraobserver Reliability

#### Week #22 – Independent Events

Two events A and B are said to be independent if P(A?B) = P(A).P(B). To put it differently, events A and B are independent if the occurrence or non-occurrence of A does not influence the occurrence of non-occurrence of B and vice-versa. For example, if…

Comments Off on Week #22 – Independent Events

#### Week #21 – Residuals

Residuals are differences between the observed values and the values predicted by some model. Analysis of residuals allows you to estimate the adequacy of a model for particular data; it is widely used in regression analysis.

Comments Off on Week #21 – Residuals

#### Week #20 – Concurrent Validity

The concurrent validity of survey instruments, like the tests used in psychometrics , is a measure of agreement between the results obtained by the given survey instrument and the results obtained for the same population by another instrument acknowledged as the "gold standard". The concurrent validity is often quantified by the correlation…

Comments Off on Week #20 – Concurrent Validity

#### Week #19 – Normality

Normality is a property of a random variable that is distributed according to the normal distribution. Normality plays a central role in both theoretical and practical statistics: a great number of theoretical statistical methods rest on the assumption that the data, or test statistics derived from…

Comments Off on Week #19 – Normality

#### Week #18 – n

In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Comments Off on Week #18 – n

#### Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

Comments Off on Week #17 – Corpus

#### Week #16 – Weighted Kappa

Weighted kappa is a measure of agreement for Categorical data . It is a generalization of the Kappa statistic to situations in which the categories are not equal in some respect - that is, weighted by an objective or subjective function.

Comments Off on Week #16 – Weighted Kappa

#### Week #15 – Rank Correlation Coefficient

Rank correlation is a method of finding the degree of association between two variables. The calculation for the rank correlation coefficient the same as that for the Pearson correlation coefficient, but is calculated using the ranks of the observations and not their numerical values. This…

Comments Off on Week #15 – Rank Correlation Coefficient

#### Week #14 – Manifest Variable

In latent variable models, a manifest variable (or indicator) is an observable variable - i.e. a variable that can be measured directly. A manifest variable can be continuous or categorical. The opposite concept is the latent variable.

Comments Off on Week #14 – Manifest Variable

#### Week #13 – Fisher´s Exact Test

Fisher´s exact test is the first (historically) permutation test. It is used with two samples of binary data, and tests the null hypothesis that the two samples are drawn from populations with equal but unknown proportions of "successes" (e.g. proportion of patients recovered without complications…

Comments Off on Week #13 – Fisher´s Exact Test

#### Week #12 – Homoscedasticity

Homoscedasticity generally means equal variation of data, e.g. equal variance.

Comments Off on Week #12 – Homoscedasticity

#### Week #11 – Posterior Probability

Posterior probability is a revised probability that takes into account new available information. For example, let there be two urns, urn A having 5 black balls and 10 red balls and urn B having 10 black balls and 5 red balls. Now if an urn…

Comments Off on Week #11 – Posterior Probability

#### Week #4 – Loss Function

A loss function specifies a penalty for an incorrect estimate from a statistical model. Typical loss functions might specify the penalty as a function of the difference between the estimate and the true value, or simply as a binary value depending on whether the estimate…

Comments Off on Week #4 – Loss Function

#### Week #3 – Endogenous Variable:

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model. In other words, endogenous variables have explicit causes within the model. The concept of endogenous variable is fundamental in path analysis and structural equation…

Comments Off on Week #3 – Endogenous Variable:

#### Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

Comments Off on Week #2 – Casual Modeling

#### Week #1 – Nonstationary time series

A time series x_t is called to be nonstationary if its statistical properties depend on time. The opposite concept is stationary time series . Most real world time series are nonstationary. An example of a nonstationary time series is a record of readings of the…

Comments Off on Week #1 – Nonstationary time series

#### Week #10 – Arm

In an experiment, an arm is a treatment protocol - for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists…

Comments Off on Week #10 – Arm

#### Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches - columns representing search terms and rows representing searches,…

Comments Off on Week #9 – Sparse Matrix

#### Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records…

Comments Off on Week #8 – Homonyms department: Sample

#### Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are…

Comments Off on Week #7 – Homonyms department: Normalization

#### Week #6 – Kolmogorov-Smirnov One-sample Test

The Kolmogorov-Smirnov one-sample test is a goodness-of-fit test, and tests whether an observed dataset is consistent with an hypothesized theoretical distribution. The test involves specifying the cumulative frequency distribution which would occur given the theoretical distribution and comparing that with the observed cumulative frequency distribution.

Comments Off on Week #6 – Kolmogorov-Smirnov One-sample Test

#### Week #5 – Cohort Data

Cohort data records multiple observations over time for a set of individuals or units tied together by some event (say, born in the same year). See also longitudinal data and panel data.

Comments Off on Week #5 – Cohort Data

#### Week #51 – Circular Icon Plots

Comments Off on Week #51 – Circular Icon Plots

#### Week #50 – Six-Sigma

Six sigma means literally six standard deviations. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any…

Comments Off on Week #50 – Six-Sigma

#### Week #47 – Psychometrics

Psychometrics or psychological testing is concerned with quantification (measurement) of human characteristics, behavior, performance, health, etc., as well as with design and analysis of studies based on such measurements. An example of the problems being solved in psychometrics is the measurement of intelligence via "IQ"…

Comments Off on Week #47 – Psychometrics

#### Week #46 – Azure ML

Azure is the Microsoft Cloud Computing Platform and Services.  ML stands for Machine Learning, and is one of the services.  Like other cloud computing services, you purchase it on a metered basis - as of 2015, there was a per-prediction charge, and a compute time…

Comments Off on Week #46 – Azure ML

#### Week #45 – Ordered categorical data

Categorical variables are non-numeric "category" variables, e.g. color.  Ordered categorical variables are category variables that have a quantitative dimension that can be ordered but is not on a regular scale.  Doctors rate pain on a scale of 1 to 10 - a "2" has no…

Comments Off on Week #45 – Ordered categorical data

#### Week #44 – Bimodal

Bimodal literally means "two modes" and is typically used to describe distributions of values that have two centers.  For example, the distribution of heights in a sample of adults might have two peaks, one for women and one for men.

Comments Off on Week #44 – Bimodal

#### Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Comments Off on Week #43 – HDFS

#### Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Comments Off on Week #42 – Kruskal – Wallis Test

#### Week #41 – Analysis of Variance (ANOVA)

A statistical technique which helps in making inference whether three or more samples might come from populations having the same mean; specifically, whether the differences among the samples might be caused by chance variation.

Comments Off on Week #41 – Analysis of Variance (ANOVA)

#### Week #40 – Two-Tailed Test

A two-tailed test is a hypothesis test in which the null hypothesis is rejected if the observed sample statistic is more extreme than the critical value in either direction (higher than the positive critical value or lower than the negative critical value). A two-tailed test…

Comments Off on Week #40 – Two-Tailed Test

#### Week #39 – Split-Halves Method

In psychometric surveys, the split-halves method is used to measure the internal consistency reliability of survey instruments, e.g. psychological tests. The idea is to split the items (questions) related to the same construct to be measured, e.d. the anxiety level, and to compare the results…

Comments Off on Week #39 – Split-Halves Method

#### Week #38 – Life Tables

In survival analysis, life tables summarize lifetime data or, generally speaking, time-to-event data. Rows in a life table usually correspond to time intervals, columns to the following categories: (i) not "failed", (ii) "failed", (iii) censored (withdrawn), and the sum of the three called "the number…

Comments Off on Week #38 – Life Tables

#### Week #37 – Truncation

Truncation, generally speaking, means to shorten. In statistics it can mean the process of limiting consideration or analysis to data that meet certain criteria (for example, the patients still alive at a certain point). Or it can refer to a data distribution where values above…

Comments Off on Week #37 – Truncation

#### Week #36 – Tukey´s HSD (Honestly Significant Differences) Test

This test is used for testing the significance of unplanned pairwise comparisons. When you do multiple significance tests, the chance of finding a "significant" difference just by chance increases. Tukey´s HSD test is one of several methods of ensuring that the chance of finding a…

Comments Off on Week #36 – Tukey´s HSD (Honestly Significant Differences) Test

#### Week #35 – Robust Filter

A robust filter is a filter that is not sensitive to input noise values with extremely large magnitude (e.g. those arising due to anomalous measurement errors). The median filter is an example of a robust filter. Linear filters are not robust - their output may…

Comments Off on Week #35 – Robust Filter

#### Week #34 – Hypothesis Testing

Hypothesis testing (also called "significance testing") is a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H0) and the alternative hypothesis ( Ha, often denoted as H1). Hypothesis testing, in a formal logic sense, rests on the presumption of validity of the null hypothesis - that is, the null hypothesis is rejected only if the data at hand testify strongly enough against it.

Comments Off on Week #34 – Hypothesis Testing

#### Week #33 – Kurtosis

Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and negative if the tails are "lighter" than for a normal distribution. The normal distribution has kurtosis of zero.

Comments Off on Week #33 – Kurtosis

#### Week #32 – False Discovery Rate

A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

Comments Off on Week #32 – False Discovery Rate

#### Week # 31 – Skewness

Comments Off on Week # 31 – Skewness

#### Week # 30 – Icon Plots

Comments Off on Week # 30 – Icon Plots

#### Week #29 – Signal

The signal is the component of the observed data that carries useful information.

Comments Off on Week #29 – Signal

#### Week #28 – Non-parametric Regression

Non-parametric regression methods are aimed at describing a relationship between the dependent and independent variables...

Comments Off on Week #28 – Non-parametric Regression

#### Week #27 – Nominal scale

A nominal scale is really a list of categories to which objects can be classified.

Comments Off on Week #27 – Nominal scale

#### Week #26 – Noise

The noise is the component of the observed data (e.g. of a time series) that is random and carries no useful information.

Comments Off on Week #26 – Noise