Oct 19: Data Literacy – The Chainsaw Case

“In the age of Big Data we often believe that our predictions about the future are better than ever before. But ...in the real world, we often get better results by using simple rules and considering less information.” (From a review of Gerd Gigerenzer’s Data Savvy)…

0 Comments

Data Literacy – The Chainsaw Case

A famous business school case by Harvard Professor Michael Porter on forecasting chainsaw sales dramatically illustrated the limits of statistical models when common business sense and clear-eyed thinking are missing. In the chainsaw case, students were asked to forecast the future U.S. demand for chainsaws,…

0 Comments

Word of the Week – Drift

In deployed machine learning pipelines, “drift” is changes in the model environment that cause the model performance to degrade over time.  Drift might result from data quality changes.  For example, increasing amounts of missing values in the input data.  Or a company might alter the…

0 Comments

Book Review – Noise

Who would have thought that an entire book devoted to the bias-variance tradeoff would make it to the NY Times business best seller list? The book is the recently-published Noise, by Daniel Kahneman, Olivier Sibony and Cass R. Sunstein, and, as of the beginning of…

0 Comments

Word of the Week – Label Spreading

A common problem in machine learning is the “rare case” situation. In many classification problems, the class of interest (fraud, purchase by a web visitor, death of a patient) is rare enough that a data sample may not have enough instances to generate useful predictions.…

0 Comments

July 22: Odds and Betting

  This week we look at odds and betting; our course spotlights are July 23 -Aug 20: SQL - Responsible Data ScienceJuly 30 -Aug 27: SQL - Biostatistics 1 - For Medical Science and Public Health See you in class! - Peter BruceFounder of The Institute for Statistics…

0 Comments

Why Statisticians Like Odds

In your introductory statistics class, probability took center stage. Odds were for gamblers. But it turns out odds play an important role in statistics, too. The relationship between the two is simple. To estimate the probability that event “A” will happen, we divide the number…

0 Comments

May 5: Deceptive Data Leaks

  This week we discuss data leaks, which can render apparently well-performing machine learning models useless. Our spotlight is on the Thomas Edison University Data Science Analytics Master’s Program, which was developed in partnership with Statistics.com See you in class! - Peter BruceFounder of The Institute…

0 Comments

Controlling Leaks

Good psychics have a knack for getting their audience to reveal, unwittingly, information that can be turned around and used in a prediction.  Statisticians and data scientists fall prey to a related phenomenon, leakage, when they allow into their models highly predictive features that would…

0 Comments

March 9: Statistics and Data Science in Practice

  This week we spotlight our introductory statistics courses, and take a look at supposed “need to know” statistical concepts for data science. See you in class! - Peter BruceFounder of The Institute for Statistics Education at Statistics.com ................................... News You Need to Know What’s happening…

0 Comments

Feb 23: Statistics and Data Science in Practice

  This week we look at a backdoor method to make predictions come true. Our student spotlight is on Staci Taylor, Assistant Prof. of Nursing at Southeastern Louisiana State Univ., and our featured programs are: Analytics for Data Science Biostatistics See you in class! -…

0 Comments

Word of the Week – Ruin Theory

The classic Gambler’s Ruin puzzle has an actuarial parallel:  “Ruin Theory,” the calculations that govern what an insurance company should charge in premiums to reduce the probability of “ruin” for a given insurance line.  “Ruin” means encountering claims that exhaust initial reserves plus accumulated premiums. …

0 Comments

Puzzle – Gambler’s Ruin

Which is better - wealth or ability?  Fred Mosteller posed this question in his classic 1965 small compendium Fifty Challenging Problems in Probability, in the context of the Gambler’s Ruin puzzle.  Two players, M and N, engage in a game in which $1 is transferred…

0 Comments

Word of the Week:  Bias

In this feature, we sometimes highlight terms that can have different meanings to different parts of the data science community, or in different contexts. Today’s term is “bias.” To the lay person, and to those worried about the ethical problems sometimes posed by the deployment…

0 Comments

Why AI Projects Fail: Type III Error

We encountered “Type III error” when it turned out that most people answering our Puzzle question were, in fact, answering a different question from the one that was asked. Type III error is answering the wrong question, and it is a big factor in the…

0 Comments

Pandemic Puzzle – Redux

Late last year, we offered this puzzle, to which no answer was provided at the time: McKinsey recently came out with a study of how a shift to remote learning has affected math test scores of students in elementary school. The impact of school closures…

0 Comments

Dec 14: Statistics in Practice

In this week’s Briefing, we take a look at different strands of “purity” in AI. Our course spotlight is Jan 15 - Feb 12: Introduction to Data Literacy It's for you or anyone you know who needs to get more numerate! See you or them in…

0 Comments

From Kaggle to Cancel: The Culture of AI

“Extremism in the defense of liberty is no vice. Moderation in the pursuit of justice is no virtue.” So said Barry Goldwater, running for U.S. President in 1964. At the time, the voters rejected his pitch for purity, and his opponent, Lyndon Johnson, won a…

0 Comments

Word of the Week – Entity Extraction

In Natural Language Processing (our course on the subject starts Jan 15), entity extraction is the process of labeling chunks of text as entities (e.g. people or organizations).  Consider this phrase from the blog on close elections linked above:   “the tie was not between Jefferson…

0 Comments

How Much Power Do Voters Have?

The recent U.S. election was one of the most controversial and closest ever, and turnout percentage may be the highest in a century. Still, 37% of the voting age population did not vote. The traditional explanation for why people don’t vote is that they feel…

0 Comments

Oct 20: Statistics in Practice

In our Briefing this week, we revisit a topic we looked at a while ago, the epidemiology of gang activity in El Salvador, and look at the impact of Covid. Our spotlight is on: Oct 30 - Nov 27:  Sample Size and Power Determination   See…

0 Comments

Student Spotlight: Suma Krishnaprasad

Meet Suma Krishnaprasad Data Scientist, Cleveland Clinic Suma Krishnaprasad was the first data scientist hired at Cleveland Clinic Abu Dhabi. She works to develop and deploy predictive models, and also to provide statistical design and analysis for dozens of clinical trials and other research projects. She…

0 Comments

Gangs and Covid

Dr. Carlos Carcach is Professor & Director of the Center for Public Policy at the Escuela Superior de Economía y Negocios (ESEN) in Santa Tecla, El Salvador, and coordinator of ESEN's post-graduate program in predictive analytics, which offers online instruction in partnership with Statistics.com, using…

0 Comments

Student Spotlight: Jessica Sproviero

Meet Jessica Sproviero,Assistant Vice President at Merrill Financial Services Jessica Sproviero has been working for several years in finance (asst. VP at Merrill Financial Services) while pursuing a degree at Thomas Edison State University (TESU).  She entered the data science portion of the program (provided…

0 Comments

Oct 6: Statistics in Practice

In our Briefing this week, we take a look at unemployment insurance fraud and a statistical tool for catching the crooks. Our course spotlight is on: Oct 23 - Nov 20:  Spatial Statistics See you in class! Peter Bruce Founder, Author, and Senior Scientist Unemployment…

0 Comments

Unemployment Insurance Fraud – Catching the Crooks

The worldwide Covid recession has led to a dramatic increase in unemployment and, hence, Unemployment Insurance (UI) claims. The figure below, from the U.S. Dept. of Labor (via https://www.npr.org/2020/03/26/821580191/unemployment-claims-expected-to-shatter-records), shows new claims on a weekly basis. Compare the March Covid-related peak on the right to…

0 Comments

Sept 30: Statistics in Practice

In our Briefing this week, we take a look at the role of statistics and analytics in war, from WWII to the present. Our curriculum spotlight is on our Rasch and IRT Mastery - key skills for those involved in designing, developing, and analyzing tests…

0 Comments

Historical Spotlight: John Tukey

The statistician John Tukey is regarded by some as the father, or at least one of the fathers, of data science.  Before Tukey, statistics meant inference (p-values, ANOVA, etc.) and models. Tukey brought to the discipline a whole new perspective: exploring the data to see…

0 Comments

Statistics at War

World War 2 gave the statistics profession its big growth spurt. Statistical methods such as correlation, regression, ANOVA, and significance testing were all worked out previously, but it was the war which brought large numbers of people to the field as a profession. They didn’t…

0 Comments

Sept 24: Statistics in Practice

This week we take a look at the interesting statistical problem of false positives, which naturally arise when you do lots of diagnostic tests or hypothesis tests.  Our course spotlight deals with another aspect of multiple statistical studies - how to combine them into a…

0 Comments

False Positive Rate – It’s Not What You Might Think

“A little knowledge is a dangerous thing,” said Alexander Pope in 1711; he could have been speaking of the use of statistics by experts in all fields. In this article, we look at three consequential mistakes in the field of statistics. Two of them are famous, the third required a deep dive into the corporate annual reports of

0 Comments

Famous Errors in Statistics

“A little knowledge is a dangerous thing,” said Alexander Pope in 1711; he could have been speaking of the use of statistics by experts in all fields. In this article, we look at three consequential mistakes in the field of statistics. Two of them are famous, the third required a deep dive into the corporate annual reports of

0 Comments

Puzzle: Surgery or Radiation

Several decades ago, the dominant therapies for lung cancer were radiation, which offered better short-term survival rates, and surgery, which offered better long-term rates. A thought experiment was conducted in which surgeons were randomly assigned to one of two groups and asked whether they would choose surgery. Group 1 was told: The one-month survival rate is 90%. Group 2 was told: There is 10% mortality in the first month. Yes, the two statements say the same thing. What did the two physician groups choose?

0 Comments

Sept 10: Statistics in Practice

This week we look at the second most popular percentage in statistics: 80%. Our course spotlight is on: Oct 30 –Nov 27: Sample Size and Power Determination See you in class! Peter Bruce Founder, Author, and Senior Scientist The Popular 80% Researchers and analysts are…

0 Comments

Type III Error

Type I error in statistical analysis is incorrectly rejecting the null hypothesis - being fooled by random chance into thinking something interesting is happening.  The arcane machinery of statistical inference - significance testing and confidence intervals - was erected to avoid Type I error.  Type II error…

0 Comments

The Popular 80%

Researchers and analysts are familiar with the famous 5% benchmark in statistics, the typical probability threshold at which a result becomes statistically significant.  (The probability in question is the probability that a result as interesting as the real-life result will happen in the null model.) …

0 Comments

Sept 2: Statistics in Practice

This week, our topic is Data Engineering, and we feature a guest blog by Will Goodrum, a data scientist at Elder Research. Our course spotlight is Oct 2 -30: Categorical Data Analysis See you in class! Peter Bruce Founder, Author, and Senior Scientist Four Common…

0 Comments

Four Common Pitfalls in Data Engineering

By Will Goodrum* Note: A version of this article was first published on the Elder Research blog. Your company has made it a strategic priority to become more data-driven. Good! A major anticipated component of this transition is to implement new data technology (e.g., a…

0 Comments

Relative Risk Ratio and Odds Ratio

The Relative Risk Ratio and Odds Ratio are both used to measure the medical effect of a treatment or variable to which people are exposed. The effect could be beneficial (from a therapy) or harmful (from a hazard).  Risk is the number of those having…

0 Comments

Aug 25: Statistics in Practice

Vaccines for Covid are in the news, and this week we focus on the clinical trial process that validates vaccines as safe and effective.  Our spotlight is on our 10-course Biostatistics Certificate Program. You can get started with Jan 3-31:  Biostatistics 1 For Medical Science…

0 Comments

Aug 19: Statistics in Practice

Last week we looked at a notable failure of the statistic “AUC”; this week we dive deeper.  Our curriculum spotlight is our 10-course Analytics for Data Science certificate program: compare cost and coverage to ANY Master’s program!   You can get started with  Sep 11 – Oct…

0 Comments

AUC: A Fatally Flawed Model Metric

By John Elder, Founder and Chair of Elder Research, Inc.  Last week, in Recidivism, and the Failure of AUC, we saw how the use of “Area Under the Curve” (AUC) concealed bias against African-Americans defendants in a model predicting recidivism, that is, which defendants would re-offend. …

0 Comments

Recidivism, and the Failure of AUC

On average, 40% - 50% of convicted criminals in the U.S. go on to commit another crime (“recidivate”) after they are released.  For nearly 20 years, court systems have used statistical and machine learning algorithms to predict the probability of recidivism, and to guide sentencing…

0 Comments

Aug 4: Statistics in Practice

In this week’s brief, we feature a data-detective story: The Case of the Faulty Generator.  Our spotlight is on our Analytics for Data Science certificate program*. See you in class! *Earn a Bachelor’s Degree in Data Science and Analytics concurrently at Thomas Edison State University.…

0 Comments

Link Function

In generalized linear models, a link function maps a nonlinear relationship to a linear one so that a linear model can be fit (and then mapped to the original form).  For example, in logistic regression, we want to find the probability of success:  P(Y =…

0 Comments

Where Outliers are Central

In casual statistical analysis, you sometimes hear references to outliers, along with the suggestion that they should be ignored or dropped from the analysis.  Quite the contrary: often it is the outliers that convey useful information.  They may represent errors in data collection, e.g. a…

0 Comments

July 28: Statistics in Practice

In this week’s brief we discuss outliers and anomalies, the unusual cases and events that often end up being the focus of attention. Our course spotlight is Nov 6 - Dec 4: Anomaly Detection If you’re interested in this topic, you should also consider the…

0 Comments

Small Ball: When a Downgrade is an Upgrade

In this mature age of digital marketing, companies have developed finely honed engines of automated and targeted promotion that factor in individual preferences and behavior.  The idea is to add small increments to revenue and profit. The system evolved in a stable era of economic…

0 Comments

Three Myths in Data Science

Myth 1:  It’s All About Prediction “Who cares whether we understand the model - as long as it predicts well!” This was one of the seeming benefits of the era of big data and predictive modeling, and it set data science apart from traditional statistics.  …

0 Comments

July 7: Statistics in Practice

As Independence Day inaugurates the official summer political season in the U.S. (a season that, in reality, no longer ends), we discuss in this week’s brief uplift models; our course spotlight is on Aug 21 - Sep 18: Persuasion Analytics and Targeting See you in…

0 Comments

Random Chance or Not?

On July 4, 1826, U.S. Independence Day, both John Adams and Thomas Jefferson, the second and third presidents of the U.S., both died within hours of each other.  Adams and Jefferson personified opposing factions in U.S. politics, with Adams favoring a strong central government and…

0 Comments

Model Interpretability

Model interpretability refers to the ability for a human to understand and articulate the relationship between a model’s predictors and its outcome.  For linear models, including linear and logistic regression, these relationships are seen directly in the model coefficients.  For black-box models like neural nets,…

0 Comments

Instructor Spotlight: Ken Strasma

Ken Strasma is a pioneer in the field of predictive analytics in high-stakes Presidential campaigns, serving as the National Targeting Director for President Obama’s historic 2008 campaign and for John Kerry’s 2004 presidential campaign. He produced the predictive analytics models used by the campaigns, and helped popularize…

0 Comments

Predicting “Do Not Disturbs”

In his book Predictive Analytics, Eric Siegel tells the story of marketing efforts at Telenor, a Norwegian telecom, to reduce churn (customers leaving for another carrier). Sophisticated analytics were used to guide the campaigns, but the managers gradually discovered that some campaigns were backfiring:  they…

0 Comments

June 30: Statistics in Practice

In this week’s Brief, the second in our series on statistical thinking, we discuss WWII convoys; our course spotlight is  July 10 - Aug 7: Spatial Statistics for GIS Using R  See you in class! Peter Bruce Founder, Author, and Senior Scientist Statistical Thinking 2  Safety…

0 Comments

Polytomous

Polytomous, applied to variables (usually outcome variables), means multi-category (i.e. more than two categories).  Synonym:  multinomial. 

0 Comments

June 23: Statistics in Practice

In this week’s Brief, the first in a Statistical Thinking series, we look at how people think about rare events. Our spotlight is on: July 3 - 31: Introductory Statistics (another session starts July 31) See you in class! Peter Bruce Founder, Author, and Senior…

0 Comments

Student Spotlight: Angelina Salinas

Meet Angelina Salinas, Data Analyst at Almacenes SIMAN Angelina Salinas started working for the retail store Almacenes Siman as a purchasing planner and, a couple of years later, got interested in data science and started to learn R. Shortly afterwards, the business intelligence group at…

0 Comments

Historical Spotlight: Iris Dataset

Can you identify this wildflower, photographed in a Massachusetts field?  And also identify its significance in the history of statistics?  This is the Blue Flag Iris, also called the Veriscolor Iris, and it is one of three Iris species that make up the famous (in statistics) Iris…

0 Comments

June 16: Statistics in Practice

In this week’s brief we feature a guest blog on Ethical Data Science; our course spotlight is: July 17 – Aug 14: Logistic Regression See you in class! Peter Bruce Founder, Author, and Senior Scientist Ethical Data Science As data science has evolved into AI,…

0 Comments

Instructor Spotlight: Joseph Hilbe

Joseph Hilbe, a prolific author in the field of statistical modeling, taught a number of Statistics.com courses right up until his death, in March of 2017.  Hilbe was elected as a Fellow of the American Statistical Association; his expertise was in statistical modeling. He did…

0 Comments

Ethical Data Science

Guest Blog - Grant Fleming, Data Scientist, Elder Research Progress in data science is largely driven by the ever-improving predictive performance of increasingly complex black-box models. However, these predictive gains have come at the expense of losing the ability to interpret the relationships derived between…

0 Comments

June 12: Statistics in Practice

In this Brief, we visit the issue of “statistical arbitrage” in financial markets, and spotlight two courses: June 12 - July 10:  Financial Risk Modeling (today) July 10 - Aug 7:  Spatial Statistics for GIS Using R See you in class! P.S.  Our newest course,…

0 Comments

Statistical Arbitrage

An economics professor and an engineering professor were walking across campus.  The engineering professor spots something lying in the grass - “Look- here’s a $20 bill!”  The economist doesn’t bother to look.  “It can’t be - somebody would have picked it up.” This old joke…

0 Comments

June 2: Statistics in Practice

Fear of catching Covid-19 dominates the world, so this week we briefly review how humans think about probabilities, in the context of Covid-19.  Prior beliefs figure heavily in probability calculations, so our course spotlight is on:  July 3 - 31:  Introduction to Bayesian Statistics  See you…

0 Comments

Bayesian Statistics

Bayesian statistics provides probability estimates of the true state of the world. An unremarkable statement, you might think -what else would statistics be for? But classical frequentist statistics, strictly speaking, only provide estimates of the state of a hothouse world, estimates that must be translated…

0 Comments

Student Spotlight: Paul Olszlyn

Meet Paul Olszlyn, Senior Data Scientist at NovoDynamics Paul Olsztyn designs and implements databases at NovoDynamics, a company that creates and deploys large scale data systems for corporations.  As his company responded to customer needs for more predictive analytics by building greater capacity in this…

0 Comments

May 26: Statistics in Practice

This week we return to Coronavirus data to look at new analyses that use mobile phone data to estimate the effects of social distancing restrictions, a vital question now are we see the world falling into “lockdown recession.”  Speaking of economic matters, our course spotlight…

0 Comments

Density

As Covid-19 continues to spread, so will research on its behavior.  Models that rely mainly on time-series data will expand to cover relevant other predictors (covariates), and one such predictor will be gregariousness.  How to measure it?  In psychology there is the standard personality trait…

0 Comments

May 19: Statistics in Practice

This week we take a look at evolutionary algorithms (it was 150 years ago that Charles Darwin first used the term “evolution” in his writings).  Our course spotlight is: July 17 - Aug 14:  Optimization with Linear Programming See you in class! - Peter Bruce…

0 Comments

Instructor Spotlight: Wayne Folta

Wayne Folta is a Lead Data Scientist with Elder Research, a leading data science consulting company and the parent of Statistics.com.  Wayne’s current ongoing project involves the extraction, analysis and redaction of text.  For example, a healthcare organization might need to release records, stripped of…

0 Comments

Parameterized

Parameterized code in computer programs (or visualizations or spreadsheets) is code where the arguments being operated on are defined once as a parameter, at the beginning, so they do not have to be repeatedly explicitly defined in the body of the code.  This allows for…

0 Comments

Evolutionary Algorithms

It was 150 years ago when Darwin first used the term “evolution” in his writing (in his book The Descent of Man).  Two months ago, in The Normal Share of Paupers, I briefly discussed the unfortunate eugenics baggage that the discipline of statistics inherited from…

0 Comments

Student Spotlight: Timothy Young

Meet Timothy Young, a Contract Administrator for the County of Los Angeles Timothy recently started the Data Science Analytics Bachelor’s Degree program that Statistics.com offers in conjunction with Thomas Edison State University (TESU) and has already been able to put his learning to work.  At…

0 Comments

May 12: Statistics in Practice

In this Brief, we dive into the terms “sensitivity” and “specificity” and their relatives.  In our course spotlight, clinical trials is the topic.  Now there’s a site just for the 800+ clinical trials associated with Covid-19 (treatments and vaccines).  Is it time for you to…

0 Comments

Sensitivity and Specificity

We defined these terms already (see this blog), but how can you remember which is which, so you don’t have to look them up?  If you can remember the order in which to recite them - sensitivity then specificity, it’s easy.  Think “positive and negative”…

0 Comments

May 5: Statistics in Practice

In this week’s Brief, we look deeper into the question of whether Covid-19 is a senior citizen disease.  Our course spotlight is twofold: Start in May or June:  Mastery in Statistical Modeling (3 courses) June 12 to July 10  Analyzing and Modeling Covid-19 Data See…

0 Comments

Decision Stumps

A decision stump is a decision tree with just one decision, leading to two or more leaves. For example, in this decision stump a borrower score of 0.475 or greater leads to a classification of “loan will default” while a borrower score less than 0.475…

0 Comments

Miasma

As more information arrives about the Coronavirus, researchers point more and more to airborne particles and aerosols as the mechanism of spread. Photographic images of a sneeze, such as this one from Lydia Bourouiba at MIT (source here), have been seen by many. It turns…

0 Comments

R0 (R-nought)

For infectious diseases, R0 (R-nought) is the unimpeded replication rate of the disease pathogen in a naive (not immune) population.  An R0 of 2 means that each person with the disease infects two others.  Some things to keep in mind:    An R0 of one means…

0 Comments