#### Puzzle – Gambler’s Ruin

Which is better - wealth or ability?  Fred Mosteller posed this question in his classic 1965 small compendium Fifty Challenging Problems in Probability, in the context of the Gambler’s Ruin puzzle.  Two players, M and N, engage in a game in which \$1 is transferred…

Comments Off on Puzzle – Gambler’s Ruin

#### Dec 14: Statistics in Practice

In this week’s Briefing, we take a look at different strands of “purity” in AI. Our course spotlight is Jan 15 - Feb 12: Introduction to Data Literacy It's for you or anyone you know who needs to get more numerate! See you or them in…

Comments Off on Dec 14: Statistics in Practice

#### Oct 6: Statistics in Practice

In our Briefing this week, we take a look at unemployment insurance fraud and a statistical tool for catching the crooks. Our course spotlight is on: Oct 23 - Nov 20:  Spatial Statistics See you in class! Peter Bruce Founder, Author, and Senior Scientist Unemployment…

Comments Off on Oct 6: Statistics in Practice

#### Famous Errors in Statistics

“A little knowledge is a dangerous thing,” said Alexander Pope in 1711; he could have been speaking of the use of statistics by experts in all fields. In this article, we look at three consequential mistakes in the field of statistics. Two of them are famous, the third required a deep dive into the corporate annual reports of

Comments Off on Famous Errors in Statistics

Several decades ago, the dominant therapies for lung cancer were radiation, which offered better short-term survival rates, and surgery, which offered better long-term rates. A thought experiment was conducted in which surgeons were randomly assigned to one of two groups and asked whether they would choose surgery. Group 1 was told: The one-month survival rate is 90%. Group 2 was told: There is 10% mortality in the first month. Yes, the two statements say the same thing. What did the two physician groups choose?

#### Sept 2: Statistics in Practice

This week, our topic is Data Engineering, and we feature a guest blog by Will Goodrum, a data scientist at Elder Research. Our course spotlight is Oct 2 -30: Categorical Data Analysis See you in class! Peter Bruce Founder, Author, and Senior Scientist Four Common…

Comments Off on Sept 2: Statistics in Practice

#### Conversations with Data Scientists about R and Python

Died-in-the-wool software developers can get quite passionate about the relative virtues of one programming language or another, their debates sometimes threatening to transport you back to middle-school arguments about the greatest ballplayers of all time.  Though their computer passions find other outlets as well, data…

#### Apr 7: Statistics in Practice

In this week’s Brief, we look in greater detail at Elder Research, Inc., which recently acquired Statistics.com.  If your organization is like most organizations, your data science initiatives may lack the direction and support they need to succeed - having a data science team does…

Comments Off on Apr 7: Statistics in Practice

#### Feb 10: Statistics in Practice

Tomorrow is the New Hampshire political primary in the US, and this week’s Brief looks at the statistical concept of lift.  Our spotlight is on: Feb 28 - Mar 27:   Persuasion Analytics and Targeting See you in class! - Peter Bruce, Founder Lift and…

Comments Off on Feb 10: Statistics in Practice

#### Aug 2: Statistics in Practice

In part 1 of this week’s brief, we looked at political analytics; in Part 2 we extend that look to commercial domains. Our course spotlight is Persuasion Analytics, taught by Ken Strasma, who pioneered the use of statistical modeling to microtarget voters in the 2004…

Comments Off on Aug 2: Statistics in Practice

#### Probability

You might be wondering why such a basic word as probability appears here. It turns out that the term has deep tendrils in formal mathematics and philosophy, but is somewhat hard to pin down

#### Density

Density is a metric that describes how well-connected a network is

#### Algorithms

We have an extensive statistical glossary and have been sending out a "word of the week" newsfeed for a number of years.  Take a look at the results

#### Gittens Index

Consider the multi-arm bandit problem where each arm has an unknown probability of paying either 0 or 1, and a specified payoff discount factor of x (i.e. for two successive payoffs, the second is valued at x% of the first, where x < 100%).  The Gittens index is [...]

#### Cold Start Problem

There are various ways to recommend additional products to an online purchaser, and the most effective ones rely on prior purchase or rating history -

Comments Off on Cold Start Problem

#### Autoregressive

Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.

#### Tensor

A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor).

#### Confusing Terms in Data Science – A Look at Synonyms

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing  synonyms, like these:

Comments Off on Confusing Terms in Data Science – A Look at Synonyms

#### Confusing Terms in Data Science – A Look at Homonyms and more Synonyms

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these:

Comments Off on Confusing Terms in Data Science – A Look at Homonyms and more Synonyms

#### Jaquard’s coefficient

When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records.  One of them is the "yacht owner" problem.

#### Rectangular data

Rectangular data are the staple of statistical and machine learning models.  Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.

#### Selection Bias

Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample.  It can occur in numerous situations, here are just a few:

#### Likert Scale

A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly."  Two key decisions the survey designer faces are

• How many gradients to allow, and

• Whether to include a neutral midpoint

#### Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

#### Things are Getting Better

In the visualization below, which line do you think represents the UN's forecast for the number of children in the world in the year 2100? Hans Rosling, in his book Factfulness, presents this chart and notes that in a sample of Norwegian teachers, only 9%…

Comments Off on Things are Getting Better

#### Conditional Probability Word of the Week

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud").  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

Comments Off on Conditional Probability Word of the Week

#### Churn

Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, "churns."

#### “out-of-bag,” as in “out-of-bag error”

"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)

Comments Off on “out-of-bag,” as in “out-of-bag error”

#### BOOTSTRAP

I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

#### Same thing, different terms..

The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.

Comments Off on Same thing, different terms..

#### CONVOLUTION and TENSOR

Today's Words of the Week are convolution and tensor, key components of deep learning.

Comments Off on CONVOLUTION and TENSOR

#### CONTINGENCY TABLES

Contingency tables are tables of counts of events or things, cross-tabulated by row and column.

#### HYPERPARAMETER

Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.

#### SAMPLE

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

#### SPLINE

The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables.

#### NLP

To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

#### OVERFIT

As applied to statistical models - "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

#### Week #18 – n

In statistics, "n" denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Comments Off on Week #18 – n

#### Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

Comments Off on Week #17 – Corpus

#### Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

Comments Off on Week #2 – Casual Modeling

#### Week #10 – Arm

In an experiment, an arm is a treatment protocol - for example, drug A, or placebo.   In medical trials, an arm corresponds to a patient group receiving a specified therapy.  The term is also relevant for bandit algorithms for web testing, where an arm consists…

Comments Off on Week #10 – Arm

#### Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued.  An example might be a binary matrix used to power web searches - columns representing search terms and rows representing searches,…

Comments Off on Week #9 – Sparse Matrix

#### Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records.  It is often, but not always, randomly drawn.  In matrix form, the rows are records…

Comments Off on Week #8 – Homonyms department: Sample

#### Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation.  When there are…

Comments Off on Week #7 – Homonyms department: Normalization

#### Week #43 – HDFS

HDFS is the Hadoop Distributed File System.  It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Comments Off on Week #43 – HDFS

#### Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Comments Off on Week #42 – Kruskal – Wallis Test

#### Week #32 – False Discovery Rate

A "discovery" is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

Comments Off on Week #32 – False Discovery Rate

#### Week #23 – Netflix Contest

The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available.  Individuals and teams then compete to develop the best performing model.

Comments Off on Week #23 – Netflix Contest

#### Week #20 – R

This week's word is actually a letter.  R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.

Comments Off on Week #20 – R

#### Week #16 – Moving Average

In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.

Comments Off on Week #16 – Moving Average

#### Week #15 – Interaction term

In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.

Comments Off on Week #15 – Interaction term

#### Week #14 – Naive forecast

A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).

Comments Off on Week #14 – Naive forecast

#### week #9 – Overdispersion

In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.

Comments Off on week #9 – Overdispersion

#### Week #8 – Confusion matrix

In a classification model, the confusion matrix shows the counts of correct and erroneous classifications.  In a binary classification problem, the matrix consists of 4 cells.

Comments Off on Week #8 – Confusion matrix

#### Week #5 – Features vs. Variables

The predictors in a predictive model are sometimes given different terms by different disciplines.  Traditional statisticians think in terms of variables.

Comments Off on Week #5 – Features vs. Variables

#### Week #48 – Structured vs. unstructured data

Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).

Comments Off on Week #48 – Structured vs. unstructured data

#### Word #39 – Censoring

Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.

Comments Off on Word #39 – Censoring

#### Work #32 – Predictive modeling

Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).

Comments Off on Work #32 – Predictive modeling

#### Week #29 – Goodness-of-fit

Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which

Comments Off on Week #29 – Goodness-of-fit

#### Week #23 – Adjacency Matrix

An adjacency matrix describes the relationships in a network. Nodes are listed in the top..

#### Week #51 – Type 1 error

In a test of significance (also called a hypothesis test), Type I error is the error of rejecting the null hypothesis when it is true -- of saying an effect or event is statistically significant when it is not.

Comments Off on Week #51 – Type 1 error

#### Week #49 – Data partitioning

Data partitioning in data mining is the division of the whole data available into two or three non-overlapping sets: the training set (used to fit the model), the validation set (used to compared models), and the test set (used to predict performance on new data).

Comments Off on Week #49 – Data partitioning

#### Week #43 – Longitudinal data

Longitudinal data records multiple observations over time for a set of individuals or units. A typical..

Comments Off on Week #43 – Longitudinal data

#### Week #42 – Cross-sectional data

Cross-sectional data refer to observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual.  A simple...

Comments Off on Week #42 – Cross-sectional data

#### Week #32 – CHAID

CHAID stands for Chi-squared Automatic Interaction Detector. It is a method for building classification trees and regression trees from a training sample comprising already-classified objects.

Comments Off on Week #32 – CHAID

#### Week # 29 – Training data

Also called the training sample, training set, calibration sample.  The context is predictive modeling (also called supervised data mining) -  where you have data with multiple predictor variables and a single known outcome or target variable.

Comments Off on Week # 29 – Training data

#### Week #18 – Centroid

The centroid is a measure of center in multi-dimensional space.

Comments Off on Week #18 – Centroid

#### 2013 – The International Year of Statistics

Promoting better understanding of statistics throughout the world.

Comments Off on 2013 – The International Year of Statistics

#### Congratulations to Michelle Everson!

New Editor of Journal of Statistics Education

Comments Off on Congratulations to Michelle Everson!

#### Airline passenger screening can be random

Read Peter's Letter to the Editor in Saturday's Washington Post.

Comments Off on Airline passenger screening can be random

#### Churn Trigger

Last year's popular story out of the Predictive Analytics World conference series was Andrew Pole's presentation of Target's methodology for predicting which customers were pregnant.

#### Randomized Trials on online learning

Evidence show that there is no significant difference between taking an online introductory statistics course and a traditional in-person class.

Comments Off on Randomized Trials on online learning

Facebook began trading around 11:30 this morning, and I spent 8 minutes

#### Congratulations to Thomas Lumley!

Newly elected American Statistical Association (ASA) Fellow, and recognized for his outstanding professional contributions to and leadership in the field of statistical science.

Comments Off on Congratulations to Thomas Lumley!

#### Immigration

Arizona's immigration law goes before the Supreme Court this week...

#### Revisiting Catastrophe Modeling Assistant

I saw this job posting a while ago, and, in my next life,

Comments Off on Revisiting Catastrophe Modeling Assistant

#### Julian Simon birthday

February 12 was the 80th anniversary of the birth of Julian Simon, an early pioneer in resampling methods.

Comments Off on Julian Simon birthday

#### Statistics for Future Presidents

Statistics for Future Presidents - Steve Pierson, Director of Science Policy at ASA wrote interesting blog wondering how statistics for future presidents (or policymakers more generally) would compare with the recommended statistical skills/concepts for others. Take a look and let him know!

Comments Off on Statistics for Future Presidents

#### Congratulations to David Unwin on a New Edited Volume

Teaching Geographic Information Science and Technology in Higher Education, 2012 (Wiley)

Comments Off on Congratulations to David Unwin on a New Edited Volume

#### The Data Scientist

The story of the prospective Facebook IPO, and prior IPO's from LinkedIn, Pandora, and Groupon all involve "data scientists".  Read an interview with Monica Rogati - Senior Data Scientist at LinkedIn to see the connection.

Comments Off on The Data Scientist

#### Congratulations to Michelle Everson for winning the 2011 Waller Education Award.

Dr. Michelle Everson is recognized for her outstanding contributions to and innovation in the teaching of elementary statistics.

Comments Off on Congratulations to Michelle Everson for winning the 2011 Waller Education Award.

#### Popular Mistakes in Data Mining

John Elder's presentations on common data mining mistakes are a must-see if you have any experience or plans in the data mining arena.

Comments Off on Popular Mistakes in Data Mining

#### Coffee causes cancer?

"Any claim coming from an observational study is most likely to be wrong." Thus begins "Deming, data and observational studies," just published in "Significance Magazine" (Sept. 2011).

Comments Off on Coffee causes cancer?

#### The sacrifice bunt

I was watching a Washington Nationals game on TV a couple of days ago, and the concept of "expected value" ...

Comments Off on The sacrifice bunt

#### Epidemiologist joke

A neurosurgeon, pathologist and epidemiologist are each told to examine a can of sardines on a table in a closed room, and present a report.

#### What do teenagers want?

What do teenagers want? More importantly for the music industry, what music will they buy?

Comments Off on What do teenagers want?

#### The Power of Round

Advertisers shy away from round numbers, believing that \$99 appears significantly cheaper than \$100...

Comments Off on The Power of Round

Did the NCAA get the March Madness rankings right? Check out SportsMeasures.com

#### Bees on the attack

What does Matt Asher's article "Attack of the Hair Trigger Bees" have to do with global warming? Matt Asher runs statisticsblog.com ...

Comments Off on Bees on the attack

#### The First Gallup Poll

The first Gallup Poll was published in October, 1935. In America Speaks,

Comments Off on The First Gallup Poll

#### Catastrophe Modeling Assistant

Thinking about careers that use statistics? The job title "catastrophe modeling assistant" caught my eye recently in a job announcement. ...

Comments Off on Catastrophe Modeling Assistant

#### Random Monkeys

One of my gifts this holiday season was "A Drunkard's Walk: How Randomness Rules Our Lives,"