Mar 24: Statistics in Practice

In this week’s Brief, we look again at the statistics of Coronavirus.  We also spotlight our Health Analytics Mastery - a 3-course series in which you can choose from among Biostatistics 1 and 2 Designing Valid Statistical Studies Epidemiologic Statistics * Introduction to Statistical Issues…

0 Comments

Covid-19 Parameters

There are many moving parts in modeling the spread of an epidemic, a subject that has lately attracted the attention of great numbers of statistically-oriented non-epidemiologists (like me).  I’ve put together a “lay statistician’s guide” to some of the important parameters and factors (and I…

0 Comments

Preliminary Paper

Here is a preliminary paper that suggests that RNA extraction kits, one of the main bottlenecks to Covid-19 testing in the US, can be skipped altogether and the next part of the assay (RT-qPCR) still works.  If confirmed, this result would have a major impact…

0 Comments

Mar 18: Statistics in Practice

In this week’s Brief, we look at the coronavirus, and the problem of estimating prevalence and mortality.  Our course spotlight is Nov 8 - Dec 6:  Epidemiologic Statistics (we're adding a spring session - email us to be notified when registration opens at ourcourses@statistics.com) See…

0 Comments

Standardized Death Rate

Often the death rate for a disease is fully known only for a group where the disease has been well studied.  For example, the 3711 passengers on the Diamond Princess cruise ship are, to date, the most fully studied coronavirus population.  All passengers were tested…

0 Comments

Coronavirus: To Test or Not to Test

In recent years, under the influence of statisticians, the medical profession has dialed back on screening tests.  With relatively rare conditions, widespread testing yields many false positives and doctor visits, whose collective cost can outweigh benefits.  Coronavirus advice follows this line - testing is limited…

0 Comments

Mar 16: Statistics in Practice

In this week’s Brief, we look at combining models.  Our course spotlight is April 17 - May 1:  Maximum Likelihood Estimation (MLE) You’ve probably seen lots of references to MLE in other contexts - this quick 2-week course (only $299) is your chance to study…

0 Comments

Regularized Model

In building statistical and machine learning models, regularization is the addition of penalty terms to predictor coefficients to discourage complex models that would otherwise overfit the data.  An example is ridge regression.

0 Comments

Ensemble Learning

In his book, The Wisdom of Crowds, James Surowiecki recounts how Francis Galton, a prominent statistician from the 19th century, attended an event at a country fair in England where the object was to guess the weight of an ox.   Individual contestants were relatively well…

0 Comments

Mar 9: Statistics in Practice

In this week’s Brief, we look at ways to determine optimal sample size.  Our course spotlight is April 10 - May 8:  Sample Size and Power Determination See you in class! - Peter Bruce Founder, Author, and Senior Scientist Big Sample, Unreliable Result The 1948…

0 Comments

Ridge Regression

Ridge regression is a method of penalizing coefficients in a regression model to force a more parsimonious model (one with fewer predictors) than would be produced by an ordinary least squares model. The term “ridge” was applied by Arthur Hoerl in 1970, who saw similarities…

0 Comments

Big Sample, Unreliable Result

Which would you rather have?  A large sample that is biased, or a representative sample that is small?  The American Statistical Association committee that reviewed the 1948 Kinsey report on male sexual behavior, based on interviews with over 5000 men, left no doubt of their…

0 Comments

Mar 2: Statistics in Practice

In this week’s Brief, we look at hierarchical and mixed models.  Our course spotlight is April 10 - May 8:  Generalized Linear Models April 24 - May 22:  Mixed and Hierarchical Linear Models See you in class! - Peter Bruce Founder, Author, and Senior Scientist…

0 Comments

Factor

The term “factor” has different meanings in statistics that can be confusing because they conflict.   In statistical programming languages like R, factor acts as an adjective, used synonymously with categorical - a factor variable is the same thing as a categorical variable.  These factor variables…

0 Comments

Mixed Models – When to Use

Companies now have a lot of data on their customers at an individual level.  Suppose you are tasked with forecasting customer spending at a grocery chain, and you want to understand how customer attributes, local economic factors, and store issues affect customer spending. You could…

0 Comments

Feb 24: Statistics in Practice

In this week’s Brief, we look at social categories, and the role that statistics and data science have played in social engineering - 100 years ago and today.  Our course spotlight is April 3 - May 1:  Categorical Data Analysis See you in class! -…

0 Comments

The Normal Share of Paupers

In 2009, China began regional pilot programs that repurposed credit scores to a broader purpose - scoring a person’s “social credit.”  100 years earlier, at the height of the eugenics craze, the famous statistician Francis Galton undertook to repurpose statistical concepts in service of social…

0 Comments

Purity

In classification, purity measures the extent to which a group of records share the same class.  It is also termed class purity or homogeneity, and sometimes impurity is measured instead.  The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where…

0 Comments

Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value - they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value.  They are of limited…

0 Comments

UpLift and Persuasion

The goal of any direct mail campaign, or other messaging effort, is to persuade somebody to do something.  In the business world, it is usually to buy something. In the political world, it is usually to vote for someone (or, if you think you know…

0 Comments

Feb 17: Statistics in Practice

Last week we looked at several metrics for assessing the performance of classification models - accuracy, receiver operating characteristics (ROC) curves, and lift (gains).  In this week’s Brief we move beyond lift and cover uplift. Our course spotlight again is: Feb 28 - Mar 27:…

0 Comments

ROC, Lift and Gains Curves

There are various metrics for assessing the performance of a classification model.  It matters which one you use. The simplest is accuracy - the proportion of cases correctly classified.  In classification tasks where the outcome of interest (“1”) is rare, though, accuracy as a metric…

0 Comments

Feb 10: Statistics in Practice

Tomorrow is the New Hampshire political primary in the US, and this week’s Brief looks at the statistical concept of lift.  Our spotlight is on: Feb 28 - Mar 27:   Persuasion Analytics and Targeting See you in class! - Peter Bruce, Founder Lift and…

0 Comments

Lift and Persuasion

Predicting the probability that something or someone will belong to a certain category (classification problems) is perhaps the oldest type of problem in analytics.  Consider the category “repays loan.” Equifax, the oldest of the agencies that provides credit scores, was founded in 1899 as the…

0 Comments

Going Beyond the Canary Trap

In 2008, Elon Musk was concerned about leaks of sensitive information at Tesla Motors.  To catch the leaker, he prepared multiple unique versions of a new nondisclosure agreement he asked senior officers to sign.  Whichever version got leaked would reveal the leak source. This is…

0 Comments

Statistics.com Acquired by Elder Research

In last week’s Brief I described how The Institute’s courses, and its Mastery, Certificate and Degree programs would continue without interruption, following our acquisition by Elder Research, Inc.  Now I’d like to talk about how the Institute’s students stand to gain from the expertise and…

0 Comments

Feb 3: Statistics in Practice

In this week’s blog, we discuss our recent acquisition by Elder Research Inc. We also look at the “Canary Trap” and its connection to text mining. Our course spotlight is on Jan 31 to Feb 28: Text Mining using Python (still open for registrations, first…

0 Comments

Choosing the Right Analytics Problem

The “streetlight effect:”  A man is looking for his keys under a streetlight.   Policeman:  “Where did you lose them?”   Man:  “In the alley, near the door to the bar.”   Policeman:  “Why are you looking here?”   Man:  “The light’s better.”   This is related to the more…

0 Comments

Jan 29: Statistics in Practice + Announcement

This week we discuss the importance of choosing the right analytics problem, with a guest blog from Elder Research, Inc., a data science and analytics consulting and training company, with whom we have just joined forces.   Our course spotlight is on: Feb 14 - Mar 13:  Design…

0 Comments

Jan 20: Statistics in Practice

This week’s Brief takes a look at ethical dilemmas in data science.  Our course spotlight is on  Feb 21 - Mar 20:  Network Analysis See you in class! - Peter Bruce, Founder and President The Institute for Statistics Education at Statistics.com Ethical Dilemmas in Data…

0 Comments

Ethical Dilemmas in Data Science

Know those ads that follow you around the web, as a result of tracking cookies?  Many see them as an invasion of privacy, and EU rules made them subject to user consent.  Google recently announced that Chrome will eventually stop supporting these cookies.  A win…

0 Comments

Kernel function

In a standard linear regression, a model is fit to a set of data (the training data); the same linear model applies to all the data.  In local regression methods, multiple models are fit to different neighborhoods of the data. A kernel function is used…

0 Comments

Jan 13: Statistics in Practice

In this Brief, we look at prosaic, but lucrative applications of predictive analytics and forecasting to the automotive industry.  Our spotlight is on our 3-course Predictive Analytics Mastery Series. Start this week with: Jan. 10 - Feb 7:   Predictive Analytics 1 See you in…

0 Comments

Industry Spotlight: Clinical Trials

 “Complete Your Clinical Trial With Our File Data” Clinical trials that support new drug development can cost over a billion dollars.  A new industry has popped up - data collectors and aggregators that provide digital data from their files as evidence in pharmaceutical clinical trials.…

0 Comments

Not Glamorous, But Lucrative

What do stormy days, weekend evenings, and the last day of the month have in common?  They are all good times to negotiate a good price for a new car. Inclement days yield less customer traffic in auto showrooms, which is good for the buyer. …

0 Comments

Jan 6: Statistics in Practice

Happy New Year! We are grateful for your continued support and appreciate your interest in learning more about statistics, analytics, and data science. In this new year, think of your learning as an investment both in the future of your company and your career. Below are courses, certificates, and…

0 Comments

Dec 30: Statistics in Practice

In this Brief, we take a look at the use of simulations as a tool to help sales people with a complex sale (high value, multiple aspects to consider).  Our spotlight is on the 3-course Mastery Series in Optimization Research, which starts January 10 with:…

0 Comments

Simulating the Complex Sale

Every 30 minutes a new business book is published; many of them purport to teach effective selling.  Most of them make sense, but solid quantitative analysis is rarely on the front burner. This is strange, because effective selling requires demonstrating value.  Sales professionals are taught…

0 Comments

Historical Spotlight: Bell Labs and Statistics

95 years ago, Bell Labs was founded as a joint project of AT&T and Western Electric.  Its primary mission was R&D for its parents’ fast-growing telecommunications businesses.  Since that time, Bell Labs became a fabled American research institution, but also suffered the vicissitudes of trying…

0 Comments

Analytics Meets the Cardboard Box

“Do you have a bag?“ or “Would you like a bag?” have become common parts of the brick-and-mortar retail transaction.  Reusable bags, or simply doing without, have reduced the flow of plastic and paper into recycling.   E-commerce is a different matter.  I just unpacked a…

0 Comments

Dec 16: Statistics in Practice

In 2005, the cardboard box was inducted into the National Toy Hall of Fame (along with Candy Land). In our brief this week we consider whether analytics has anything to say about cardboard boxes. Our course spotlight is on: Jan 3 - 31:  R Programming…

0 Comments

Problem of the Week: A betting puzzle

QUESTION: A gambler playing against the “house” in a game like roulette or slots adopts the rule “Play until you win a certain amount, then stop.”  Will this ensure against player losses? What will be its effect on the house’s profit? ANSWER: Some look at this…

0 Comments

Dec 6: Statistics in Practice

This week we look at the casino business - in particular, the odds on slots. In our course spotlight, we start looking at some of the great stuff starting in at the beginning of the new year. In January, you can get started with basic statistics or biostatistics,…

0 Comments

Google Zooms Out on Microtargeting

Google recently announced that it would further limit its election ads to audience targeting based on age, gender, and general location (postal code level) context targeting (i.e. showing ads based on the content being viewed) Up to this point, the application of predictive modeling to…

0 Comments

Betting and Statistics

Betting has had a long and close relationship with the science of probability and statistics.  In the mid-1600’s, the French intellectual and gambler Antoine Gombaud, who called himself Chevalier de Méré, enlisted the help of the mathematician Blaise Pascal to solve several puzzles involving dice…

0 Comments

Operations Research (O/R) For Sewage

Older urban sewer systems are not sealed, dedicated route networks leading to sewage treatment plants.  Rather, to save money when they were built decades ago, in some places they shared pipes with storm water drainage systems that lead to creeks, rivers and bays.  As a…

0 Comments

Nov 25: Statistics in Practice

In this week’s Brief, we take a look at the history of betting and how it is entwined with probabilistic decision-making. Probabilistic decision-making is also the focus of our 3-course Optimization Mastery, which covers linear programming, integer programming, simulation and other operations research (O/R) techniques. Start…

0 Comments

Errors and Loss

Errors - differences between predicted values and actual values, also called residuals - are a key part of statistical models.  They form the raw material for various metrics of predictive model performance (accuracy, precision, recall, lift, etc.), and also the basis for diagnostics on descriptive…

0 Comments

Student Spotlight: Peter Mulready

Peter Mulready is an independent consultant, who worked previously as a system architect at Boehringer Ingelheim, one of the world's largest pharmaceutical companies. Peter got his degree in biology, but his focus shifted to managing and optimizing the use of data in drug discovery research. …

0 Comments

e-cigarettes

Last week, the Trump administration announced a forthcoming ban on e-cigarettes, following news stories of a spate of deaths from vaping.  The Wall Street Journal, on Friday the 13th, published both an editorial and an op-ed piece suggesting that any harm from e-cigarettes is minor…

0 Comments

“Islands in Search of Contents”

“Islands in Search of Continents” is the subtitle of an article by Michael Clarke and Iain Chalmers in the Journal of the American Medical Association (1998; 280: 280-282).  It refers to the fact that many studies are conducted and reported in isolation from other studies on the…

0 Comments

Superusers

“Superusers” of medical services are the small fraction of patients that account for huge consumption of medical services.  An article published August 14, 2019 in JAMA Surgery (online) reports on the application of machine learning methods to Medicare data on 1,049,160 Medicare patients who underwent surgery,…

0 Comments

Aug 2: Statistics in Practice

In part 1 of this week’s brief, we looked at political analytics; in Part 2 we extend that look to commercial domains. Our course spotlight is Persuasion Analytics, taught by Ken Strasma, who pioneered the use of statistical modeling to microtarget voters in the 2004…

0 Comments

Probability

You might be wondering why such a basic word as probability appears here. It turns out that the term has deep tendrils in formal mathematics and philosophy, but is somewhat hard to pin down

0 Comments

Algorithms

We have an extensive statistical glossary and have been sending out a "word of the week" newsfeed for a number of years.  Take a look at the results

0 Comments

Gittens Index

Consider the multi-arm bandit problem where each arm has an unknown probability of paying either 0 or 1, and a specified payoff discount factor of x (i.e. for two successive payoffs, the second is valued at x% of the first, where x < 100%).  The Gittens index is [...]

0 Comments

Autoregressive

Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.

0 Comments

Rectangular data

Rectangular data are the staple of statistical and machine learning models.  Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.

0 Comments

Selection Bias

Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample.  It can occur in numerous situations, here are just a few:

0 Comments

Likert Scale

A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly."  Two key decisions the survey designer faces are

  • How many gradients to allow, and

  • Whether to include a neutral midpoint

0 Comments

Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

0 Comments

Things are Getting Better

In the visualization below, which line do you think represents the UN's forecast for the number of children in the world in the year 2100? Hans Rosling, in his book Factfulness, presents this chart and notes that in a sample of Norwegian teachers, only 9%…

0 Comments

Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

0 Comments

Conditional Probability Word of the Week

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud").  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

0 Comments

Churn

Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, "churns."

0 Comments

ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:

 

0 Comments

BOOTSTRAP

I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

0 Comments

HYPERPARAMETER

Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.

0 Comments

SAMPLE

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

0 Comments

SPLINE

 

The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables. 

0 Comments

NLP

To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

0 Comments

OVERFIT

As applied to statistical models - "overfit" means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

0 Comments

Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task.  Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms.  More typically, the corpus is a body of documents for…

0 Comments

Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2…

0 Comments