Space Shuttle Explosion

In 1986, the U.S. space shuttle Challenger exploded several minutes after launch. A later investigation found that the cause of the disaster was O-ring failure, due to cold temperatures. The temperature at launch was 39 degrees, colder than any prior launch. The cold caused the…

Comments Off on Space Shuttle Explosion

Alaskan Generosity

People in Alaska are extraordinarily generous - that's what a predictive model showed, when applied to a charitable organization's donor list. A closer examination revealed a flaw - while the original data was for all 50 states, the model's training data for Alaska included donors,…

Comments Off on Alaskan Generosity

Why Analytics Projects Fail – 5 Reasons

With the news full of so many successes in the fields of analytics, machine learning and artificial intelligence, it is easy to lose sight of the high failure rate of analytics projects. McKinsey just came out with a report that only 8% of big companies…

Comments Off on Why Analytics Projects Fail – 5 Reasons

Historical Spotlight – ISOQOL

25 years ago the International Society of Quality of Life Research was founded with a mission to advance the science of quality of life and related patient-centered outcomes in health research, care and policy. While focusing on quality of life (QOL) in healthcare may seem…

Comments Off on Historical Spotlight – ISOQOL

Likert Scale

A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly."  Two key decisions the survey designer faces are

  • How many gradients to allow, and

  • Whether to include a neutral midpoint

Comments Off on Likert Scale

Football Analytics

Preparing for the Superbowl Your team is at midfield, you have the ball, it's 4th down with 2 yards to go. Should you go for it? (Apologies in advance to our many readers, especially those outside the U.S., who are not aficionados of American football,…

Comments Off on Football Analytics

Job Spotlight: Digital Marketer

A digital marketer handles a variety of tasks in online marketing - managing online advertising and search engine optimization (SEO), implementing tracking systems (e.g. to identify how a person came to a retailer), web development, preparing creatives, implementing tests, and, of course, analytics. There are…

Comments Off on Job Spotlight: Digital Marketer

Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

Comments Off on Dummy Variable

Things are Getting Better

In the visualization below, which line do you think represents the UN's forecast for the number of children in the world in the year 2100? Hans Rosling, in his book Factfulness, presents this chart and notes that in a sample of Norwegian teachers, only 9%…

Comments Off on Things are Getting Better

Artificial Lawyers

Can statistical and machine learning methods replace lawyers? A host of entrepreneurs think so, and do the folks who run Text mining and predictive model products are available now to predict case staffing requirements and perform automated document discovery, and natural language algorithms conduct…

Comments Off on Artificial Lawyers

Entity Resolution and Identifying Bad Guys

Earlier, we described how Jen Golbeck (who teaches Network Analysis at analyzed Facebook connections to identify fake accounts (the account holders friends all had the same number of friends, which is highly improbable statistically). Network analysis and studying connections lie at the heart of…

Comments Off on Entity Resolution and Identifying Bad Guys

Work and Heat

If you are working on New Year's Eve or New Year's Day, odds are it is from home, where you can (usually) control the temperature in the home. Which, from the standpoint of productivity, is a good thing. According to a study from Cornell, raising…

Comments Off on Work and Heat


Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb.  With a pretense of being an individual selling a car on his or her own, and with no fixed…

Comments Off on Curbstoning

Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

Comments Off on Snowball Sampling

The False Alarm Conundrum

False alarms are one of the most poorly understood problems in applied statistics and biostatistics. The fundamental problem is the wide application of a statistical or diagnostic test in search of something that is relatively rare. Consider the Apple Watch's new feature that detects atrial…

Comments Off on The False Alarm Conundrum

Conditional Probability

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud").  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

Comments Off on Conditional Probability

Instructor Spotlight – David Kleinbaum

David Kleinbaum developed several courses for, including Survival Analysis, Epidemiologic Statistics, and Designing Valid Statistical Studies. David retired a little over a year ago from Emory University, where he was a popular and effective teacher with the ability to distill and explain difficult statistical…

Comments Off on Instructor Spotlight – David Kleinbaum

Book Review: Active-Epi

ActivEpi Web, by David Kleinbaum, is the text used in two courses (Epidemiology Statistics and Designing Valid Studies), but it is really a rich multimedia web-based presentation of epidemiological statistics, serving the role of a unique textbook format for an introductory course in the…

Comments Off on Book Review: Active-Epi


Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, "churns."

Comments Off on Churn

Survival Analysis

Convinced that he, like his father, would die in his 40's, Winston Churchill lived his early life in a frenetic hurry. He had participated in four wars on three continents by his mid-20's, served in multiple ministerial positions by his 30's, and published 12 books…

Comments Off on Survival Analysis

How Google Determines Which Ads you See

A classic machine learning task is to predict something's class, usually binary - pictures as dogs or cats, insurance claims as fraud or not, etc. Often the goal is not a final classification, but an estimate of the probability of belonging to a class (propensity),…

Comments Off on How Google Determines Which Ads you See

Job Spotlight: Data Scientist

Data science is one of a host of similar terms. Artificial intelligence has been around since the 1960's and data mining for at least a couple of decades. Machine learning came out of the computer science community, and analytics, data analytics, and predictive analytics came…

Comments Off on Job Spotlight: Data Scientist

ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:


Comments Off on ROC Curve

Deming’s Funnel Problem

W. Edwards Deming's funnel problem is one of statistics' greatest hits. Deming was a noted statistician who took the statistical process control methods of Shewhart and expanded them into a holistic approach to manufacturing quality. Initially, his ideas were cooly received in the US and…

Comments Off on Deming’s Funnel Problem

Industry Spotlight: the Auto Industry

The auto industry serves as a perfect exemplar of three key eras of statistics and data science in service of industry: Total Quality Management (TQM) First in Japan, and later in the U.S., the auto industry became an enthusiastic adherent to the Total Quality Management…

Comments Off on Industry Spotlight: the Auto Industry

Analytics Professionals – Must They Be Good Communicators?

Most job ads in the technical arena list communication among the sought-after skills; it consistently outranks many programming and analytical skills. Is it for real, or is it just thrown in there by the HR Department on general principle? The founder of a leading analytics…

Comments Off on Analytics Professionals – Must They Be Good Communicators?

Prospective vs. Retrospective

A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you're measuring, who you're measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition…

Comments Off on Prospective vs. Retrospective

The Evolution of Clinical Trials

Boiling oil versus egg yolks One early clinical trial was accidental. In the 16th century, a common treatment for wounded soldiers was to pour boiling oil on their wounds. In 1537, the surgeon Ambroise Pare, attending French soldiers, ran out of oil one evening. He…

Comments Off on The Evolution of Clinical Trials

GE Regresses to the Mean

Thirty years ago, GE became the brightest star in the firmament of statistical ideas in business when it adopted Six Sigma methods of quality improvement. Those methods had been introduced by Motorola, but Jack Welch's embrace of the same methods at GE, a diverse manufacturing…

Comments Off on GE Regresses to the Mean

Examples of Bad Forecasting

In a couple of days, theWall Street Journalwill come out with its November survey of economists' forecasts. It's a particularly sensitive time, with elections in a few days and President Trump attacking the Federal Reserve for for raising interest rates. It's a good time to…

Comments Off on Examples of Bad Forecasting

Historical Spotlight: Risk Simulation – Since 1946

Simulation - a Venerable History One of the most consequential and valuable analytical tools in business is simulation, which helps us make decisions in the face of uncertainty, such as these:   An airline knows on average, what proportion of ticketed passengers show up for a…

Comments Off on Historical Spotlight: Risk Simulation – Since 1946

“out-of-bag,” as in “out-of-bag error”

"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)

Comments Off on “out-of-bag,” as in “out-of-bag error”


I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

Comments Off on BOOTSTRAP

Same thing, different terms..

The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.


Comments Off on Same thing, different terms..

100 years of variance

It is 100 years since R A Fischer introduced the concept of "variance" (in his 1918 paper "The Correlation Between Relatives on the Supposition of Mendelian Inheritance"). There is much that statistics has given us in the century that followed. Randomized clinical trials, and the means…

Comments Off on 100 years of variance

Early Data Scientists

Casting back long before the advent of Deep Learning for the "founding fathers" of data science, at first glance you would rule out antecedents who long predate the computer and data revolutions of the last quarter century. But some consider John Tukey (right), the Princeton statistician…

Comments Off on Early Data Scientists

Python for Analytics

Python started out as a general purpose language when it was created in 1991 by Guido van Rossum. It was embraced early on by Google founders Sergei Brin and Larry Page ("Python where we can, C++ where we must" was reputedly their mantra). In 2006,…

Comments Off on Python for Analytics

Course Spotlight: Deep Learning

Deep learning is essentially "neural networks on steroids" and it lies at the core of the most intriguing and powerful applications of artificial intelligence. Facial recognition (which you encounter daily in Facebook and other social media) harnesses many levels of data science tools, including algorithms…

Comments Off on Course Spotlight: Deep Learning

Course Spotlight: Structural Equation Modelling (SEM)

SEM stands for "structural equation modeling," and we are fortunate to have Prof. Randall Schumacker teaching this subject at Randy created the Structural Equation Modeling (SEM) journal in 1994 and the Structural Equation Modeling Special Interest Group (SIG) at the American Educational Research Association…

Comments Off on Course Spotlight: Structural Equation Modelling (SEM)

Benford’s Law Applies to Online Social Networks

Fake social media accounts and Russian meddling in US elections have been in the news lately, with Mark Zuckerberg (Facebook founder) testifying this week before the US Congress. Dr. Jen Golbeck, who teaches Network Analysis at, published an ingenious way to determine whether a…

Comments Off on Benford’s Law Applies to Online Social Networks

The Real Facebook Controversy

Cambridge Analytica's wholesale scraping of Facebook user data is big news now, and people are shocked that personal data is being shared and traded on a massive scale on the internet. But the real issue with social media is not harming to individual users whose…

Comments Off on The Real Facebook Controversy

Masters Programs versus an Online Certificate in Data Science from

We just attended the analytics conference of INFORMS' (The Institute for Operations Research and the Management Sciences) this week in Baltimore, and they held a special meeting for directors of academic analytics programs to better align what universities are producing with what industry is seeking.…

Comments Off on Masters Programs versus an Online Certificate in Data Science from

Course Spotlight: Likert scale assessment surveys

Do you work with multiple choice tests, or Likert scale assessment surveys? Rasch methods help you construct linear measures from these forms of scored observations and analyze the results from such surveys and tests. "Practical Rasch Measurement - Core Topics" In this course, you will…

Comments Off on Course Spotlight: Likert scale assessment surveys

Course Spotlight: Customer Analytics in R

"The customer is always right" was the motto Selfridge's department store coined in 1909. "We'll tell the customer what they want" was Madison Avenue's mantra starting in the 1950's. Now data scientists like Karolis Urbonas help companies like Amazon (where he works in Europe as…

Comments Off on Course Spotlight: Customer Analytics in R

Course Spotlight: Spatial Statistics Using R

Have you ever needed to analyze data with a spatial component? Geographic clusters of disease, crimes, animals, plants, events?Or describing the spatial variation of something, and perhaps correlating it with some other predictor? Assessing whether the geographic distribution of something departs from randomness? Location data…

Comments Off on Course Spotlight: Spatial Statistics Using R

“Money and Brains” and “Furs and Station Wagons”

"Money and Brains" and "Furs and Station Wagons" were evocative customer shorthands that the marketing company Claritas came up with over a half century ago. These names, which facilitated the work of marketers and sales people, were shorthand descriptions of segments of customers identified through…

Comments Off on “Money and Brains” and “Furs and Station Wagons”

Course Spotlight: Text Mining

The term text mining is sometimes used in two different meanings in computational statistics: Using predictive modeling to label many documents (e.g. legal docs might be "relevant" or "not relevant") - this is what we call text mining. Using grammar and syntax to parse the…

Comments Off on Course Spotlight: Text Mining


Benford's law describes an expected distribution of the first digit in many naturally-occurring datasets.

Comments Off on BENFORD’S LAW


Contingency tables are tables of counts of events or things, cross-tabulated by row and column.



Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.



Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

Comments Off on SAMPLE



The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables. 

Comments Off on SPLINE


To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

Comments Off on NLP


As applied to statistical models - "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

Comments Off on OVERFIT

Quotes about Data Science

“The goal is to turn data into information, and information into insight.” – Carly Fiorina, former CEO, Hewlett-Packard Co. Speech given at Oracle OpenWorld “Data is the new science. Big data holds the answers.” – Pat Gelsinger, CEO, EMC, Big Bets on Big Data, Forbes“Hiding within those…

Comments Off on Quotes about Data Science

Week #24 – Logit

Logit is a nonlinear function of probability. If p is the probability of an event, then the corresponding logit is given by the formula: logit(p) = log  p (1 - p)   Logit is widely used to construct statistical models, for example in logistic regression. 

Comments Off on Week #24 – Logit

Week #23 – Intraobserver Reliability

Intraobserver reliability indicates how stable are responses obtained from the same respondent at different time points. The greater the difference between the responses, the smaller the intraobserver reliability of the survey instrument. The correlation coefficient between the responses obtained at different time points from the same respondent is often…

Comments Off on Week #23 – Intraobserver Reliability

Week #22 – Independent Events

Two events A and B are said to be independent if P(A?B) = P(A).P(B). To put it differently, events A and B are independent if the occurrence or non-occurrence of A does not influence the occurrence of non-occurrence of B and vice-versa. For example, if…

Comments Off on Week #22 – Independent Events

Week #21 – Residuals

Residuals are differences between the observed values and the values predicted by some model. Analysis of residuals allows you to estimate the adequacy of a model for particular data; it is widely used in regression analysis. 

Comments Off on Week #21 – Residuals

Week #20 – Concurrent Validity

The concurrent validity of survey instruments, like the tests used in psychometrics , is a measure of agreement between the results obtained by the given survey instrument and the results obtained for the same population by another instrument acknowledged as the "gold standard". The concurrent validity is often quantified by the correlation…

Comments Off on Week #20 – Concurrent Validity

Week #19 – Normality

Normality is a property of a random variable that is distributed according to the normal distribution. Normality plays a central role in both theoretical and practical statistics: a great number of theoretical statistical methods rest on the assumption that the data, or test statistics derived from…

Comments Off on Week #19 – Normality

Week #18 – n

In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Comments Off on Week #18 – n

Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

Comments Off on Week #17 – Corpus

Week #16 – Weighted Kappa

Weighted kappa is a measure of agreement for Categorical data . It is a generalization of the Kappa statistic to situations in which the categories are not equal in some respect - that is, weighted by an objective or subjective function.

Comments Off on Week #16 – Weighted Kappa

Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

April 27 marks the 80th anniversary of the death of Karl Pearson, who contributed to statistics the correlation coefficient, principal components, the (increasingly-maligned) p-value, and much more. Pearson was one of a trio of founding fathers of modern statistics, the others being Francis Galton and…

Comments Off on Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

Week #15 – Rank Correlation Coefficient

Rank correlation is a method of finding the degree of association between two variables. The calculation for the rank correlation coefficient the same as that for the Pearson correlation coefficient, but is calculated using the ranks of the observations and not their numerical values. This…

Comments Off on Week #15 – Rank Correlation Coefficient

Week #14 – Manifest Variable

In latent variable models, a manifest variable (or indicator) is an observable variable - i.e. a variable that can be measured directly. A manifest variable can be continuous or categorical. The opposite concept is the latent variable.

Comments Off on Week #14 – Manifest Variable

Week #13 – Fisher´s Exact Test

Fisher´s exact test is the first (historically) permutation test. It is used with two samples of binary data, and tests the null hypothesis that the two samples are drawn from populations with equal but unknown proportions of "successes" (e.g. proportion of patients recovered without complications…

Comments Off on Week #13 – Fisher´s Exact Test

Week #11 – Posterior Probability

Posterior probability is a revised probability that takes into account new available information. For example, let there be two urns, urn A having 5 black balls and 10 red balls and urn B having 10 black balls and 5 red balls. Now if an urn…

Comments Off on Week #11 – Posterior Probability

Week #4 – Loss Function

A loss function specifies a penalty for an incorrect estimate from a statistical model. Typical loss functions might specify the penalty as a function of the difference between the estimate and the true value, or simply as a binary value depending on whether the estimate…

Comments Off on Week #4 – Loss Function

Week #3 – Endogenous Variable:

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model. In other words, endogenous variables have explicit causes within the model. The concept of endogenous variable is fundamental in path analysis and structural equation…

Comments Off on Week #3 – Endogenous Variable:

Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

Comments Off on Week #2 – Casual Modeling

Week #1 – Nonstationary time series

A time series x_t is called to be nonstationary if its statistical properties depend on time. The opposite concept is stationary time series . Most real world time series are nonstationary. An example of a nonstationary time series is a record of readings of the…

Comments Off on Week #1 – Nonstationary time series

Week #10 – Arm

In an experiment, an arm is a treatment protocol - for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists…

Comments Off on Week #10 – Arm

Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches - columns representing search terms and rows representing searches,…

Comments Off on Week #9 – Sparse Matrix

Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records…

Comments Off on Week #8 – Homonyms department: Sample

Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are…

Comments Off on Week #7 – Homonyms department: Normalization

Week #6 – Kolmogorov-Smirnov One-sample Test

The Kolmogorov-Smirnov one-sample test is a goodness-of-fit test, and tests whether an observed dataset is consistent with an hypothesized theoretical distribution. The test involves specifying the cumulative frequency distribution which would occur given the theoretical distribution and comparing that with the observed cumulative frequency distribution.

Comments Off on Week #6 – Kolmogorov-Smirnov One-sample Test

Week #5 – Cohort Data

Cohort data records multiple observations over time for a set of individuals or units tied together by some event (say, born in the same year). See also longitudinal data and panel data.

Comments Off on Week #5 – Cohort Data

Week #50 – Six-Sigma

Six sigma means literally six standard deviations. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any…

Comments Off on Week #50 – Six-Sigma

Week #47 – Psychometrics

Psychometrics or psychological testing is concerned with quantification (measurement) of human characteristics, behavior, performance, health, etc., as well as with design and analysis of studies based on such measurements. An example of the problems being solved in psychometrics is the measurement of intelligence via "IQ"…

Comments Off on Week #47 – Psychometrics

Week #46 – Azure ML

Azure is the Microsoft Cloud Computing Platform and Services.  ML stands for Machine Learning, and is one of the services.  Like other cloud computing services, you purchase it on a metered basis - as of 2015, there was a per-prediction charge, and a compute time…

Comments Off on Week #46 – Azure ML

Week #45 – Ordered categorical data

Categorical variables are non-numeric "category" variables, e.g. color.  Ordered categorical variables are category variables that have a quantitative dimension that can be ordered but is not on a regular scale.  Doctors rate pain on a scale of 1 to 10 - a "2" has no…

Comments Off on Week #45 – Ordered categorical data

Week #44 – Bimodal

Bimodal literally means "two modes" and is typically used to describe distributions of values that have two centers.  For example, the distribution of heights in a sample of adults might have two peaks, one for women and one for men.  

Comments Off on Week #44 – Bimodal

Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Comments Off on Week #43 – HDFS
Close Menu