Feb 24: Statistics in Practice

In this week’s Brief, we look at social categories, and the role that statistics and data science have played in social engineering - 100 years ago and today.  Our course spotlight is April 3 - May 1:  Categorical Data Analysis See you in class! -…

Comments Off on Feb 24: Statistics in Practice

The Normal Share of Paupers

In 2009, China began regional pilot programs that repurposed credit scores to a broader purpose - scoring a person’s “social credit.”  100 years earlier, at the height of the eugenics craze, the famous statistician Francis Galton undertook to repurpose statistical concepts in service of social…

Comments Off on The Normal Share of Paupers

Purity

In classification, purity measures the extent to which a group of records share the same class.  It is also termed class purity or homogeneity, and sometimes impurity is measured instead.  The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where…

Comments Off on Purity

Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value - they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value.  They are of limited…

Comments Off on Predictor P-Values in Predictive Modeling

UpLift and Persuasion

The goal of any direct mail campaign, or other messaging effort, is to persuade somebody to do something.  In the business world, it is usually to buy something. In the political world, it is usually to vote for someone (or, if you think you know…

Comments Off on UpLift and Persuasion

Feb 17: Statistics in Practice

Last week we looked at several metrics for assessing the performance of classification models - accuracy, receiver operating characteristics (ROC) curves, and lift (gains).  In this week’s Brief we move beyond lift and cover uplift. Our course spotlight again is: Feb 28 - Mar 27:…

Comments Off on Feb 17: Statistics in Practice

ROC, Lift and Gains Curves

There are various metrics for assessing the performance of a classification model.  It matters which one you use. The simplest is accuracy - the proportion of cases correctly classified.  In classification tasks where the outcome of interest (“1”) is rare, though, accuracy as a metric…

Comments Off on ROC, Lift and Gains Curves

Feb 10: Statistics in Practice

Tomorrow is the New Hampshire political primary in the US, and this week’s Brief looks at the statistical concept of lift.  Our spotlight is on: Feb 28 - Mar 27:   Persuasion Analytics and Targeting See you in class! - Peter Bruce, Founder Lift and…

Comments Off on Feb 10: Statistics in Practice

Lift and Persuasion

Predicting the probability that something or someone will belong to a certain category (classification problems) is perhaps the oldest type of problem in analytics.  Consider the category “repays loan.” Equifax, the oldest of the agencies that provides credit scores, was founded in 1899 as the…

Comments Off on Lift and Persuasion

Going Beyond the Canary Trap

In 2008, Elon Musk was concerned about leaks of sensitive information at Tesla Motors.  To catch the leaker, he prepared multiple unique versions of a new nondisclosure agreement he asked senior officers to sign.  Whichever version got leaked would reveal the leak source. This is…

Comments Off on Going Beyond the Canary Trap

Statistics.com Acquired by Elder Research

In last week’s Brief I described how The Institute’s courses, and its Mastery, Certificate and Degree programs would continue without interruption, following our acquisition by Elder Research, Inc.  Now I’d like to talk about how the Institute’s students stand to gain from the expertise and…

Comments Off on Statistics.com Acquired by Elder Research

Feb 3: Statistics in Practice

In this week’s blog, we discuss our recent acquisition by Elder Research Inc. We also look at the “Canary Trap” and its connection to text mining. Our course spotlight is on Jan 31 to Feb 28: Text Mining using Python (still open for registrations, first…

Comments Off on Feb 3: Statistics in Practice

Choosing the Right Analytics Problem

The “streetlight effect:”  A man is looking for his keys under a streetlight.   Policeman:  “Where did you lose them?”   Man:  “In the alley, near the door to the bar.”   Policeman:  “Why are you looking here?”   Man:  “The light’s better.”   This is related to the more…

Comments Off on Choosing the Right Analytics Problem

Book Review: Mining Your Own Business by Gerhard Pilcher and Jeff Deal

This is a short book, Mining Your Own Business: A Primer for Executives on Understanding and Employing Data Mining and  Predictive Analytics" befitting its intended audience - managers and executives with responsibility for data science and analytics projects.  It outlines the requirements for success - not technical model…

Comments Off on Book Review: Mining Your Own Business by Gerhard Pilcher and Jeff Deal

Jan 29: Statistics in Practice + Announcement

This week we discuss the importance of choosing the right analytics problem, with a guest blog from Elder Research, Inc., a data science and analytics consulting and training company, with whom we have just joined forces.   Our course spotlight is on: Feb 14 - Mar 13:  Design…

Comments Off on Jan 29: Statistics in Practice + Announcement

Ethical Dilemmas in Data Science

Know those ads that follow you around the web, as a result of tracking cookies?  Many see them as an invasion of privacy, and EU rules made them subject to user consent.  Google recently announced that Chrome will eventually stop supporting these cookies.  A win…

Comments Off on Ethical Dilemmas in Data Science

Not Glamorous, But Lucrative

What do stormy days, weekend evenings, and the last day of the month have in common?  They are all good times to negotiate a good price for a new car. Inclement days yield less customer traffic in auto showrooms, which is good for the buyer. …

Comments Off on Not Glamorous, But Lucrative

Betting and Statistics

Betting has had a long and close relationship with the science of probability and statistics.  In the mid-1600’s, the French intellectual and gambler Antoine Gombaud, who called himself Chevalier de Méré, enlisted the help of the mathematician Blaise Pascal to solve several puzzles involving dice…

Comments Off on Betting and Statistics

Of Note: Operations Research (O/R) For Sewage

Older urban sewer systems are not sealed, dedicated route networks leading to sewage treatment plants.  Rather, to save money when they were built decades ago, in some places they shared pipes with storm water drainage systems that lead to creeks, rivers and bays.  As a…

Comments Off on Of Note: Operations Research (O/R) For Sewage

Errors and Loss

Errors - differences between predicted values and actual values, also called residuals - are a key part of statistical models.  They form the raw material for various metrics of predictive model performance (accuracy, precision, recall, lift, etc.), and also the basis for diagnostics on descriptive…

Comments Off on Errors and Loss

Unforeseen Consequences in Data Science

Unforeseen Consequences in Data Science After the massive Exxon Valdez oil spill, states passed laws boosting the liability of tanker companies for future spills.  The result was not as intended: fly-by-night companies, whose bankruptcy would not be consequential, took over the trade. In this blog…

Comments Off on Unforeseen Consequences in Data Science

Intervals (confidence, prediction and tolerance)

All students of statistics encounter confidence intervals.  Confidence intervals tell you, roughly, the interval within which you can be, say, 95% confident that the true value of some sample statistic lies.  This is not the precise technical definition, but it is how people use the…

Comments Off on Intervals (confidence, prediction and tolerance)

Social Network Analysis (SNA) in Medicine

In hospitals, “sentinel events” are events that carry with them a significant risk of unexpected death or harm.  It is estimated that ⅔ of such sentinel events result from communications failures during the handoff of a patient from one provider to another (e.g. during a…

Comments Off on Social Network Analysis (SNA) in Medicine

Industry Spotlight: Baseball (Sports) Statistics

The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame).  Analytics has come a long way in…

Comments Off on Industry Spotlight: Baseball (Sports) Statistics

Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

Comments Off on Snowball Sampling

ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:

 

Comments Off on ROC Curve
Close Menu