100 years of variance

It is 100 years since R A Fischer introduced the concept of "variance" (in his 1918 paper "The Correlation Between Relatives on the Supposition of Mendelian Inheritance"). There is much that statistics has given us in the century that followed. Randomized clinical trials, and the means…

Comments Off on 100 years of variance

Early Data Scientists

Casting back long before the advent of Deep Learning for the "founding fathers" of data science, at first glance you would rule out antecedents who long predate the computer and data revolutions of the last quarter century. But some consider John Tukey (right), the Princeton statistician…

Comments Off on Early Data Scientists

Python for Analytics

Python started out as a general purpose language when it was created in 1991 by Guido van Rossum. It was embraced early on by Google founders Sergei Brin and Larry Page ("Python where we can, C++ where we must" was reputedly their mantra). In 2006,…

Comments Off on Python for Analytics

Course Spotlight: Deep Learning

Deep learning is essentially "neural networks on steroids" and it lies at the core of the most intriguing and powerful applications of artificial intelligence. Facial recognition (which you encounter daily in Facebook and other social media) harnesses many levels of data science tools, including algorithms…

Comments Off on Course Spotlight: Deep Learning

Course Spotlight: Structural Equation Modelling (SEM)

SEM stands for "structural equation modeling," and we are fortunate to have Prof. Randall Schumacker teaching this subject at Statistics.com. Randy created the Structural Equation Modeling (SEM) journal in 1994 and the Structural Equation Modeling Special Interest Group (SIG) at the American Educational Research Association…

Comments Off on Course Spotlight: Structural Equation Modelling (SEM)

Benford’s Law Applies to Online Social Networks

Fake social media accounts and Russian meddling in US elections have been in the news lately, with Mark Zuckerberg (Facebook founder) testifying this week before the US Congress. Dr. Jen Golbeck, who teaches Network Analysis at Statistics.com, published an ingenious way to determine whether a…

Comments Off on Benford’s Law Applies to Online Social Networks

The Real Facebook Controversy

Cambridge Analytica's wholesale scraping of Facebook user data is big news now, and people are shocked that personal data is being shared and traded on a massive scale on the internet. But the real issue with social media is not harming to individual users whose…

Comments Off on The Real Facebook Controversy

Masters Programs versus an Online Certificate in Data Science from Statistics.com

We just attended the analytics conference of INFORMS' (The Institute for Operations Research and the Management Sciences) this week in Baltimore, and they held a special meeting for directors of academic analytics programs to better align what universities are producing with what industry is seeking.…

Comments Off on Masters Programs versus an Online Certificate in Data Science from Statistics.com

Course Spotlight: Likert scale assessment surveys

Do you work with multiple choice tests, or Likert scale assessment surveys? Rasch methods help you construct linear measures from these forms of scored observations and analyze the results from such surveys and tests. "Practical Rasch Measurement - Core Topics" In this course, you will…

Comments Off on Course Spotlight: Likert scale assessment surveys

Course Spotlight: Customer Analytics in R

"The customer is always right" was the motto Selfridge's department store coined in 1909. "We'll tell the customer what they want" was Madison Avenue's mantra starting in the 1950's. Now data scientists like Karolis Urbonas help companies like Amazon (where he works in Europe as…

Comments Off on Course Spotlight: Customer Analytics in R

Course Spotlight: Spatial Statistics Using R

Have you ever needed to analyze data with a spatial component? Geographic clusters of disease, crimes, animals, plants, events?Or describing the spatial variation of something, and perhaps correlating it with some other predictor? Assessing whether the geographic distribution of something departs from randomness? Location data…

Comments Off on Course Spotlight: Spatial Statistics Using R

“Money and Brains” and “Furs and Station Wagons”

"Money and Brains" and "Furs and Station Wagons" were evocative customer shorthands that the marketing company Claritas came up with over a half century ago. These names, which facilitated the work of marketers and sales people, were shorthand descriptions of segments of customers identified through…

Comments Off on “Money and Brains” and “Furs and Station Wagons”

Course Spotlight: Text Mining

The term text mining is sometimes used in two different meanings in computational statistics: Using predictive modeling to label many documents (e.g. legal docs might be "relevant" or "not relevant") - this is what we call text mining. Using grammar and syntax to parse the…

Comments Off on Course Spotlight: Text Mining


Benford's law describes an expected distribution of the first digit in many naturally-occurring datasets.

Comments Off on BENFORD’S LAW


Contingency tables are tables of counts of events or things, cross-tabulated by row and column.



Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.



Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

Comments Off on SAMPLE



The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables. 

Comments Off on SPLINE


To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

Comments Off on NLP


As applied to statistical models - "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

Comments Off on OVERFIT

Quotes about Data Science

“The goal is to turn data into information, and information into insight.” – Carly Fiorina, former CEO, Hewlett-Packard Co. Speech given at Oracle OpenWorld “Data is the new science. Big data holds the answers.” – Pat Gelsinger, CEO, EMC, Big Bets on Big Data, Forbes“Hiding within those…

Comments Off on Quotes about Data Science

Week #24 – Logit

Logit is a nonlinear function of probability. If p is the probability of an event, then the corresponding logit is given by the formula: logit(p) = log  p (1 - p)   Logit is widely used to construct statistical models, for example in logistic regression. 

Comments Off on Week #24 – Logit

Week #23 – Intraobserver Reliability

Intraobserver reliability indicates how stable are responses obtained from the same respondent at different time points. The greater the difference between the responses, the smaller the intraobserver reliability of the survey instrument. The correlation coefficient between the responses obtained at different time points from the same respondent is often…

Comments Off on Week #23 – Intraobserver Reliability

Week #22 – Independent Events

Two events A and B are said to be independent if P(A?B) = P(A).P(B). To put it differently, events A and B are independent if the occurrence or non-occurrence of A does not influence the occurrence of non-occurrence of B and vice-versa. For example, if…

Comments Off on Week #22 – Independent Events

Week #21 – Residuals

Residuals are differences between the observed values and the values predicted by some model. Analysis of residuals allows you to estimate the adequacy of a model for particular data; it is widely used in regression analysis. 

Comments Off on Week #21 – Residuals

Week #20 – Concurrent Validity

The concurrent validity of survey instruments, like the tests used in psychometrics , is a measure of agreement between the results obtained by the given survey instrument and the results obtained for the same population by another instrument acknowledged as the "gold standard". The concurrent validity is often quantified by the correlation…

Comments Off on Week #20 – Concurrent Validity

Week #19 – Normality

Normality is a property of a random variable that is distributed according to the normal distribution. Normality plays a central role in both theoretical and practical statistics: a great number of theoretical statistical methods rest on the assumption that the data, or test statistics derived from…

Comments Off on Week #19 – Normality

Week #18 – n

In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Comments Off on Week #18 – n

Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

Comments Off on Week #17 – Corpus

Week #16 – Weighted Kappa

Weighted kappa is a measure of agreement for Categorical data . It is a generalization of the Kappa statistic to situations in which the categories are not equal in some respect - that is, weighted by an objective or subjective function.

Comments Off on Week #16 – Weighted Kappa

Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

April 27 marks the 80th anniversary of the death of Karl Pearson, who contributed to statistics the correlation coefficient, principal components, the (increasingly-maligned) p-value, and much more. Pearson was one of a trio of founding fathers of modern statistics, the others being Francis Galton and…

Comments Off on Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

Week #15 – Rank Correlation Coefficient

Rank correlation is a method of finding the degree of association between two variables. The calculation for the rank correlation coefficient the same as that for the Pearson correlation coefficient, but is calculated using the ranks of the observations and not their numerical values. This…

Comments Off on Week #15 – Rank Correlation Coefficient

Week #14 – Manifest Variable

In latent variable models, a manifest variable (or indicator) is an observable variable - i.e. a variable that can be measured directly. A manifest variable can be continuous or categorical. The opposite concept is the latent variable.

Comments Off on Week #14 – Manifest Variable

Week #13 – Fisher´s Exact Test

Fisher´s exact test is the first (historically) permutation test. It is used with two samples of binary data, and tests the null hypothesis that the two samples are drawn from populations with equal but unknown proportions of "successes" (e.g. proportion of patients recovered without complications…

Comments Off on Week #13 – Fisher´s Exact Test

Week #11 – Posterior Probability

Posterior probability is a revised probability that takes into account new available information. For example, let there be two urns, urn A having 5 black balls and 10 red balls and urn B having 10 black balls and 5 red balls. Now if an urn…

Comments Off on Week #11 – Posterior Probability

Week #4 – Loss Function

A loss function specifies a penalty for an incorrect estimate from a statistical model. Typical loss functions might specify the penalty as a function of the difference between the estimate and the true value, or simply as a binary value depending on whether the estimate…

Comments Off on Week #4 – Loss Function

Week #3 – Endogenous Variable:

Endogenous variables in causal modeling are the variables with causal links (arrows) leading to them from other variables in the model. In other words, endogenous variables have explicit causes within the model. The concept of endogenous variable is fundamental in path analysis and structural equation…

Comments Off on Week #3 – Endogenous Variable:

Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

Comments Off on Week #2 – Casual Modeling

Week #1 – Nonstationary time series

A time series x_t is called to be nonstationary if its statistical properties depend on time. The opposite concept is stationary time series . Most real world time series are nonstationary. An example of a nonstationary time series is a record of readings of the…

Comments Off on Week #1 – Nonstationary time series

Week #10 – Arm

In an experiment, an arm is a treatment protocol - for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists…

Comments Off on Week #10 – Arm

Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches - columns representing search terms and rows representing searches,…

Comments Off on Week #9 – Sparse Matrix

Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records…

Comments Off on Week #8 – Homonyms department: Sample

Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are…

Comments Off on Week #7 – Homonyms department: Normalization

Week #6 – Kolmogorov-Smirnov One-sample Test

The Kolmogorov-Smirnov one-sample test is a goodness-of-fit test, and tests whether an observed dataset is consistent with an hypothesized theoretical distribution. The test involves specifying the cumulative frequency distribution which would occur given the theoretical distribution and comparing that with the observed cumulative frequency distribution.

Comments Off on Week #6 – Kolmogorov-Smirnov One-sample Test

Week #5 – Cohort Data

Cohort data records multiple observations over time for a set of individuals or units tied together by some event (say, born in the same year). See also longitudinal data and panel data.

Comments Off on Week #5 – Cohort Data

Week #50 – Six-Sigma

Six sigma means literally six standard deviations. The phrase refers to the limits drawn on statistical process control charts used to plot statistics from samples taken regularly from a production process. Consider the process mean. A process is deemed to be "in control" at any…

Comments Off on Week #50 – Six-Sigma

Week #47 – Psychometrics

Psychometrics or psychological testing is concerned with quantification (measurement) of human characteristics, behavior, performance, health, etc., as well as with design and analysis of studies based on such measurements. An example of the problems being solved in psychometrics is the measurement of intelligence via "IQ"…

Comments Off on Week #47 – Psychometrics

Week #46 – Azure ML

Azure is the Microsoft Cloud Computing Platform and Services.  ML stands for Machine Learning, and is one of the services.  Like other cloud computing services, you purchase it on a metered basis - as of 2015, there was a per-prediction charge, and a compute time…

Comments Off on Week #46 – Azure ML

Week #45 – Ordered categorical data

Categorical variables are non-numeric "category" variables, e.g. color.  Ordered categorical variables are category variables that have a quantitative dimension that can be ordered but is not on a regular scale.  Doctors rate pain on a scale of 1 to 10 - a "2" has no…

Comments Off on Week #45 – Ordered categorical data

Week #44 – Bimodal

Bimodal literally means "two modes" and is typically used to describe distributions of values that have two centers.  For example, the distribution of heights in a sample of adults might have two peaks, one for women and one for men.  

Comments Off on Week #44 – Bimodal

Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Comments Off on Week #43 – HDFS

Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Comments Off on Week #42 – Kruskal – Wallis Test

Week #41 – Analysis of Variance (ANOVA)

A statistical technique which helps in making inference whether three or more samples might come from populations having the same mean; specifically, whether the differences among the samples might be caused by chance variation.

Comments Off on Week #41 – Analysis of Variance (ANOVA)

Week #40 – Two-Tailed Test

A two-tailed test is a hypothesis test in which the null hypothesis is rejected if the observed sample statistic is more extreme than the critical value in either direction (higher than the positive critical value or lower than the negative critical value). A two-tailed test…

Comments Off on Week #40 – Two-Tailed Test

Week #39 – Split-Halves Method

In psychometric surveys, the split-halves method is used to measure the internal consistency reliability of survey instruments, e.g. psychological tests. The idea is to split the items (questions) related to the same construct to be measured, e.d. the anxiety level, and to compare the results…

Comments Off on Week #39 – Split-Halves Method

Week #38 – Life Tables

In survival analysis, life tables summarize lifetime data or, generally speaking, time-to-event data. Rows in a life table usually correspond to time intervals, columns to the following categories: (i) not "failed", (ii) "failed", (iii) censored (withdrawn), and the sum of the three called "the number…

Comments Off on Week #38 – Life Tables

Week #37 – Truncation

Truncation, generally speaking, means to shorten. In statistics it can mean the process of limiting consideration or analysis to data that meet certain criteria (for example, the patients still alive at a certain point). Or it can refer to a data distribution where values above…

Comments Off on Week #37 – Truncation

Week #36 – Tukey´s HSD (Honestly Significant Differences) Test

This test is used for testing the significance of unplanned pairwise comparisons. When you do multiple significance tests, the chance of finding a "significant" difference just by chance increases. Tukey´s HSD test is one of several methods of ensuring that the chance of finding a…

Comments Off on Week #36 – Tukey´s HSD (Honestly Significant Differences) Test

Week #35 – Robust Filter

A robust filter is a filter that is not sensitive to input noise values with extremely large magnitude (e.g. those arising due to anomalous measurement errors). The median filter is an example of a robust filter. Linear filters are not robust - their output may…

Comments Off on Week #35 – Robust Filter

Week #34 – Hypothesis Testing

Hypothesis testing (also called "significance testing") is a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H0) and the alternative hypothesis ( Ha, often denoted as H1). Hypothesis testing, in a formal logic sense, rests on the presumption of validity of the null hypothesis - that is, the null hypothesis is rejected only if the data at hand testify strongly enough against it.

Comments Off on Week #34 – Hypothesis Testing

Week #33 – Kurtosis

Kurtosis measures the "heaviness of the tails" of a distribution (in compared to a normal distribution). Kurtosis is positive if the tails are "heavier" then for a normal distribution, and negative if the tails are "lighter" than for a normal distribution. The normal distribution has kurtosis of zero.

Comments Off on Week #33 – Kurtosis

Week #32 – False Discovery Rate

A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

Comments Off on Week #32 – False Discovery Rate

Week # 24 – Edge

In a network analysis context, "edge" refers to a link or connection between two entities in a network

Comments Off on Week # 24 – Edge

Week #23 – Netflix Contest

The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available.  Individuals and teams then compete to develop the best performing model.

Comments Off on Week #23 – Netflix Contest

Week #22 – Splines

The linear model is ubiquitous in classical statistics, yet real-life data rarely follow a purely linear pattern.

Comments Off on Week #22 – Splines

Week #20 – R

This week's word is actually a letter.  R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.

Comments Off on Week #20 – R

Week #19 – Prediction vs. Explanation

With the advent of Big Data and data mining, statistical methods like regression and CART have been repurposed to use as tools in predictive modeling.

Comments Off on Week #19 – Prediction vs. Explanation

Week #17 – A-B Test

An A-B test is a classic statistical design in which individuals or subjects are randomly split into two groups and some intervention or treatment is applied.

Comments Off on Week #17 – A-B Test

Be Smarter Than Your Devices: Learn About Big Data

When Apple CEO Tim Cook finally unveiled his company's new Apple Watch in a widely-publicizedrolloutearlier this month, most of the press coverage centered on its cost ($349 to start) and whether it would be as popular among consumers as the iPod or iMac. Nitin Indurkhyasaw…

Comments Off on Be Smarter Than Your Devices: Learn About Big Data

Week #16 – Moving Average

In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.

Comments Off on Week #16 – Moving Average

Week #15 – Interaction term

In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.

Comments Off on Week #15 – Interaction term

Week #14 – Naive forecast

A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).

Comments Off on Week #14 – Naive forecast

Week #13 – RMSE

RMSE is root mean squared error.  In predicting a numerical outcome with a statistical model, predicted values rarely match actual outcomes exactly.

Comments Off on Week #13 – RMSE

Week #12 – Label

A label is a category into which a record falls, usually in the context of predictive modeling.  Label, class and category are different names for discrete values of a target (outcome) variable.

Comments Off on Week #12 – Label

Week #11 – Spark

Spark is a second generation computing environment that sits on top of a Hadoop system, supporting the workflows that leverage a distributed file system.

Comments Off on Week #11 – Spark

Week #10 – Bandits

Bandits refers to a class of algorithms in which users or subjects make repeated choices among, or decisions in reaction to, multiple alternatives.

Comments Off on Week #10 – Bandits

week #9 – Overdispersion

In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.

Comments Off on week #9 – Overdispersion

Week #8 – Confusion matrix

In a classification model, the confusion matrix shows the counts of correct and erroneous classifications.  In a binary classification problem, the matrix consists of 4 cells.

Comments Off on Week #8 – Confusion matrix

Week #7 – Multiple looks

In a classic statistical experiment, treatment(s) and placebo are applied to randomly assigned subjects, and, at the end of the experiment, outcomes are compared.

Comments Off on Week #7 – Multiple looks

Week #6 – Pruning the tree

Classification and regression trees, applied to data with known values for an outcome variable, derive models with rules like "If taxable income <$80,000, if no Schedule C income, if standard deduction taken, then no-audit."

Comments Off on Week #6 – Pruning the tree

Week #5 – Features vs. Variables

The predictors in a predictive model are sometimes given different terms by different disciplines.  Traditional statisticians think in terms of variables.

Comments Off on Week #5 – Features vs. Variables

Week #4 – Logistic Regression

In logistic regression, we seek to estimate the relationship between predictor variables Xi and a binary response variable.  Specifically, we want to estimate the probability p that the response variable will be a 0 or a 1.

Comments Off on Week #4 – Logistic Regression

Week #3 – Prior and posterior

Bayesian statistics typically incorporates new information (e.g. from a diagnostic test, or a recently drawn sample) to answer a question of the form "What is the probability that..."

Comments Off on Week #3 – Prior and posterior
Close Menu