#### Controlling Leaks

#### March 9: Statistics and Data Science in Practice

March 9: Statistics and Data Science in Practice

This week we spotlight our introductory statistics courses, and take a look at supposed "need to know" statistical concepts for data science.

#### Feb 23: Statistics and Data Science in Practice

Feb 23: Statistics and Data Science in Practice

This week we look at a backdoor method to make predictions come true. Our student spotlight is on Staci Taylor, Assistant Prof. of Nursing at Southeastern Louisiana State Univ.

#### Word of the Week – Ruin Theory

The classic Gambler’s Ruin puzzle has an actuarial parallel:  “Ruin Theory,” the calculations that govern what an insurance company should charge in premiums to reduce the probability of “ruin” for a given insurance line.  “Ruin” means encountering claims that exhaust initial reserves plus accumulated premiums. …

#### Puzzle – Gambler’s Ruin

Which is better - wealth or ability?  Fred Mosteller posed this question in his classic 1965 small compendium Fifty Challenging Problems in Probability, in the context of the Gambler’s Ruin puzzle.  Two players, M and N, engage in a game in which \$1 is transferred…

#### As an Aspiring Data Scientist, What Do I Really Need to Know About Statistics?

As the popularity of data science has grown, so too has advice on how to get jobs in data science.  A common form of advice is a list of sample questions you might be asked at your job interview (see here and here for examples). …

#### Making Predictions Self-Fulfilling Prophecies

The Three Stooges comedy sketch “Ants in the Pantry,” (1935), features the Stooges as exterminators whose target customers don’t have enough pests.  The Stooges solve the problem by bringing their own pests (ants and mice) on sales calls, and surreptitiously releasing them. Having created the…

#### Student Spotlight – Staci Taylor

“I wouldn’t have said I loved statistics before I started at Statistics.com.  But now I do.  I really do.” Staci Taylor is an Assistant Professor of Nursing at Southeastern Louisiana State University.  She was asked to teach statistics, and signed up with Statistics.com to refresh…

#### Word of the Week:  Bias

In this feature, we sometimes highlight terms that can have different meanings to different parts of the data science community, or in different contexts. Today’s term is “bias.” To the lay person, and to those worried about the ethical problems sometimes posed by the deployment…

#### Why AI Projects Fail: Type III Error

We encountered “Type III error” when it turned out that most people answering our Puzzle question were, in fact, answering a different question from the one that was asked. Type III error is answering the wrong question, and it is a big factor in the…

#### Pandemic Puzzle – Redux

Late last year, we offered this puzzle, to which no answer was provided at the time: McKinsey recently came out with a study of how a shift to remote learning has affected math test scores of students in elementary school. The impact of school closures…

#### Dec 14: Statistics in Practice

Dec 14: Statistics in Practice

In this week's Briefing, we take a look at different strands of "purity" in AI.

#### PUZZLE OF THE WEEK – School in the Pandemic

This week we offer a puzzle to which I have no answer.  McKinsey recently came out with a study of how a shift to remote learning has affected test scores of students in elementary school. The impact of school closures has been large and negative…

#### From Kaggle to Cancel: The Culture of AI

“Extremism in the defense of liberty is no vice. Moderation in the pursuit of justice is no virtue.” So said Barry Goldwater, running for U.S. President in 1964. At the time, the voters rejected his pitch for purity, and his opponent, Lyndon Johnson, won a…

#### Word of the Week – Entity Extraction

In Natural Language Processing (our course on the subject starts Jan 15), entity extraction is the process of labeling chunks of text as entities (e.g. people or organizations).  Consider this phrase from the blog on close elections linked above:   “the tie was not between Jefferson…

#### How Much Power Do Voters Have?

The recent U.S. election was one of the most controversial and closest ever, and turnout percentage may be the highest in a century. Still, 37% of the voting age population did not vote. The traditional explanation for why people don’t vote is that they feel…

#### Oct 20: Statistics in Practice

Oct 20: Statistics in Practice

In our Briefing this week, we revisit a topic we looked at a while ago, the epidemiology of gang activity in El Salvador, and look at the impact of Covid.

Meet Suma Krishnaprasad Data Scientist, Cleveland Clinic Suma Krishnaprasad was the first data scientist hired at Cleveland Clinic Abu Dhabi. She works to develop and deploy predictive models, and also to provide statistical design and analysis for dozens of clinical trials and other research projects. She…

#### Gangs and Covid

Dr. Carlos Carcach is Professor & Director of the Center for Public Policy at the Escuela Superior de Economía y Negocios (ESEN) in Santa Tecla, El Salvador, and coordinator of ESEN's post-graduate program in predictive analytics, which offers online instruction in partnership with Statistics.com, using…

#### Student Spotlight: Jessica Sproviero

Meet Jessica Sproviero,Assistant Vice President at Merrill Financial Services Jessica Sproviero has been working for several years in finance (asst. VP at Merrill Financial Services) while pursuing a degree at Thomas Edison State University (TESU).  She entered the data science portion of the program (provided…

#### Oct 6: Statistics in Practice

Oct 6: Statistics in Practice

In our Briefing this week, we take a look at unemployment insurance fraud and a statistical tool for catching the crooks.

#### Unemployment Insurance Fraud – Catching the Crooks

The worldwide Covid recession has led to a dramatic increase in unemployment and, hence, Unemployment Insurance (UI) claims. The figure below, from the U.S. Dept. of Labor (via https://www.npr.org/2020/03/26/821580191/unemployment-claims-expected-to-shatter-records), shows new claims on a weekly basis. Compare the March Covid-related peak on the right to…

#### Sept 30: Statistics in Practice

Sept 30: Statistics in Practice

In our Briefing this week, we take a look at the role of statistics and analytics in war, from WWII to the present. Our curriculum spotlight is on our Rasch and IRT Mastery - key skills for those involved in designing, developing, and analyzing tests

#### Historical Spotlight: John Tukey

The statistician John Tukey is regarded by some as the father, or at least one of the fathers, of data science.  Before Tukey, statistics meant inference (p-values, ANOVA, etc.) and models. Tukey brought to the discipline a whole new perspective: exploring the data to see…

#### Statistics at War

World War 2 gave the statistics profession its big growth spurt. Statistical methods such as correlation, regression, ANOVA, and significance testing were all worked out previously, but it was the war which brought large numbers of people to the field as a profession. They didn’t…

#### Sept 24: Statistics in Practice

Sept 24: Statistics in Practice

This week we take a look at the interesting statistical problem of false positives, which naturally arise when you do lots of diagnostic tests or hypothesis tests.

#### False Positive Rate – It’s Not What You Might Think

“A little knowledge is a dangerous thing,” said Alexander Pope in 1711; he could have been speaking of the use of statistics by experts in all fields. In this article, we look at three consequential mistakes in the field of statistics. Two of them are famous, the third required a deep dive into the corporate annual reports of

#### Famous Errors in Statistics

Several decades ago, the dominant therapies for lung cancer were radiation, which offered better short-term survival rates, and surgery, which offered better long-term rates. A thought experiment was conducted in which surgeons were randomly assigned to one of two groups and asked whether they would choose surgery. Group 1 was told: The one-month survival rate is 90%. Group 2 was told: There is 10% mortality in the first month. Yes, the two statements say the same thing. What did the two physician groups choose?

#### Sept 10: Statistics in Practice

Sept 10: Statistics in Practice

This week we look at the second most popular percentage in statistics: 80%.

#### Type III Error

Type I error in statistical analysis is incorrectly rejecting the null hypothesis - being fooled by random chance into thinking something interesting is happening.  The arcane machinery of statistical inference - significance testing and confidence intervals - was erected to avoid Type I error.  Type II error…

#### The Popular 80%

Researchers and analysts are familiar with the famous 5% benchmark in statistics, the typical probability threshold at which a result becomes statistically significant.  (The probability in question is the probability that a result as interesting as the real-life result will happen in the null model.) …

#### Sept 2: Statistics in Practice

Sept 2: Statistics in Practice

This week, our topic is Data Engineering, and we feature a guest blog by Will Goodrum, a data scientist at Elder Research.

#### Four Common Pitfalls in Data Engineering

By Will Goodrum* Note: A version of this article was first published on the Elder Research blog. Your company has made it a strategic priority to become more data-driven. Good! A major anticipated component of this transition is to implement new data technology (e.g., a…

#### Relative Risk Ratio and Odds Ratio

The Relative Risk Ratio and Odds Ratio are both used to measure the medical effect of a treatment or variable to which people are exposed. The effect could be beneficial (from a therapy) or harmful (from a hazard).  Risk is the number of those having…

#### Aug 25: Statistics in Practice

Aug 25: Statistics in Practice

Vaccines for Covid are in the news, and this week we focus on the clinical trial process that validates vaccines as safe and effective.

#### Of Note: An outlier that lies in the middle of the data

An outlier or anomaly is typically defined as a case that is markedly distant or different from the bulk of the data.  Our July 28 blog on outliers and anomaly detection reported on one unusual case in which the outlier might lie fully within the…

#### Clinical Trial Process that Validates Vaccines as Safe and Effective

As of this writing, there are about 40 Coronavirus vaccines in the clinical trial process, plus another 135 in preclinical development. Russia has jumped the gun and “approved” a vaccine that has just begun Phase 3 trials, and, likewise, China has approved a pre-Phase 3…

#### Aug 19: Statistics in Practice

Aug 19: Statistics in Practice

Last week we looked at a notable failure of the statistic "AUC"; this week we dive deeper.

#### AUC: A Fatally Flawed Model Metric

By John Elder, Founder and Chair of Elder Research, Inc.  Last week, in Recidivism, and the Failure of AUC, we saw how the use of “Area Under the Curve” (AUC) concealed bias against African-Americans defendants in a model predicting recidivism, that is, which defendants would re-offend. …

#### Recidivism, and the Failure of AUC

On average, 40% - 50% of convicted criminals in the U.S. go on to commit another crime (“recidivate”) after they are released.  For nearly 20 years, court systems have used statistical and machine learning algorithms to predict the probability of recidivism, and to guide sentencing…

#### Endpoint or Outcome (example: Covid-19 vaccine)

In a randomized experiment, the endpoint or outcome is a formal measure (statistic) of the result of the experiment.  In a randomized clinical trial preparatory to regulatory submission, there is often more than one outcome, due to the time and expense involved in conducting a…

#### Aug 4: Statistics in Practice

Aug 4: Statistics in Practice

In this week's brief, we feature a data-detective story: The Case of the Faulty Generator.

In generalized linear models, a link function maps a nonlinear relationship to a linear one so that a linear model can be fit (and then mapped to the original form).  For example, in logistic regression, we want to find the probability of success:  P(Y =…

#### Sira-Kvina Hydro Power –The Case of the Faulty Generator

Prepared by Peter Bruce, Mark Smith and Ramon Perez, this case study was originally published at elderresearch.com.   In early 2020, Sira-Kvina Kraftselskap, a large producer of hydroelectric power in Norway, suffered a breakdownof one of its major generators. Company technicians went through established diagnostics…

#### Where Outliers are Central

In casual statistical analysis, you sometimes hear references to outliers, along with the suggestion that they should be ignored or dropped from the analysis.  Quite the contrary: often it is the outliers that convey useful information.  They may represent errors in data collection, e.g. a…

#### July 28: Statistics in Practice

July 28: Statistics in Practice

In this week's brief we discuss outliers and anomalies, the unusual cases and events that often end up being the focus of attention.

In this mature age of digital marketing, companies have developed finely honed engines of automated and targeted promotion that factor in individual preferences and behavior.  The idea is to add small increments to revenue and profit. The system evolved in a stable era of economic…

#### Three Myths in Data Science

Myth 1:  It’s All About Prediction “Who cares whether we understand the model - as long as it predicts well!” This was one of the seeming benefits of the era of big data and predictive modeling, and it set data science apart from traditional statistics.  …

#### July 21: Statistics in Practice

July 21: Statistics in Practice

In this week's brief, a continuation of our "Statistical Thinking" series, we reflect on three "myths" in data science and statistics

#### July 7: Statistics in Practice

July 7: Statistics in Practice

As Independence Day inaugurates the official summer political season in the U.S. (a season that, in reality, no longer ends), we discuss in this week's brief uplift models

#### Random Chance or Not?

On July 4, 1826, U.S. Independence Day, both John Adams and Thomas Jefferson, the second and third presidents of the U.S., both died within hours of each other.  Adams and Jefferson personified opposing factions in U.S. politics, with Adams favoring a strong central government and…

#### Model Interpretability

Model interpretability refers to the ability for a human to understand and articulate the relationship between a model’s predictors and its outcome.  For linear models, including linear and logistic regression, these relationships are seen directly in the model coefficients.  For black-box models like neural nets,…

#### Instructor Spotlight: Ken Strasma

Ken Strasma is a pioneer in the field of predictive analytics in high-stakes Presidential campaigns, serving as the National Targeting Director for President Obama’s historic 2008 campaign and for John Kerry’s 2004 presidential campaign. He produced the predictive analytics models used by the campaigns, and helped popularize…

#### Predicting “Do Not Disturbs”

In his book Predictive Analytics, Eric Siegel tells the story of marketing efforts at Telenor, a Norwegian telecom, to reduce churn (customers leaving for another carrier). Sophisticated analytics were used to guide the campaigns, but the managers gradually discovered that some campaigns were backfiring:  they…

#### June 30: Statistics in Practice

June 30: Statistics in Practice

In this week's Brief, the second in our series on statistical thinking, we discuss WWII convoys

#### Safety in Numbers – Calculating Probabilities for Convoys

Statistical Thinking 2: Safety in Numbers – Calculating Probabilities for Convoys  Early 1942 was a critical period for the Allies in WWII.  Russia was on its heels, with German armies at the gates of Moscow and preparing an offensive in southern Russia.  Alone among the…

#### Polytomous

Polytomous, applied to variables (usually outcome variables), means multi-category (i.e. more than two categories).  Synonym:  multinomial.

#### June 23: Statistics in Practice

June 23: Statistics in Practice

In this week's Brief, the first in a Statistical Thinking series, we look at how people think about rare events.

#### Student Spotlight: Angelina Salinas

Meet Angelina Salinas, Data Analyst at Almacenes SIMAN Angelina Salinas started working for the retail store Almacenes Siman as a purchasing planner and, a couple of years later, got interested in data science and started to learn R. Shortly afterwards, the business intelligence group at…

#### Historical Spotlight: Iris Dataset

Can you identify this wildflower, photographed in a Massachusetts field?  And also identify its significance in the history of statistics?  This is the Blue Flag Iris, also called the Veriscolor Iris, and it is one of three Iris species that make up the famous (in statistics) Iris…

#### Rare Event Syndrome

Statistical Thinking 1   Several years ago, an NPR reporter wanted a comment from me for his story about an unusual event: a woman had won a state lottery jackpot for a second time. Winning once was low enough odds, but winning twice?   The reporter found…

#### June 16: Statistics in Practice

June 16: Statistics in Practice

In this week's brief we feature a guest blog on Ethical Data Science

#### Instructor Spotlight: Joseph Hilbe

Joseph Hilbe, a prolific author in the field of statistical modeling, taught a number of Statistics.com courses right up until his death, in March of 2017.  Hilbe was elected as a Fellow of the American Statistical Association; his expertise was in statistical modeling. He did…

#### Ethical Data Science

Guest Blog - Grant Fleming, Data Scientist, Elder Research Progress in data science is largely driven by the ever-improving predictive performance of increasingly complex black-box models. However, these predictive gains have come at the expense of losing the ability to interpret the relationships derived between…

#### June 12: Statistics in Practice

June 12: Statistics in Practice

In this Brief, we visit the issue of "statistical arbitrage" in financial markets

#### Statistical Arbitrage

An economics professor and an engineering professor were walking across campus.  The engineering professor spots something lying in the grass - “Look- here’s a \$20 bill!”  The economist doesn’t bother to look.  “It can’t be - somebody would have picked it up.” This old joke…

#### June 2: Statistics in Practice

June 2: Statistics in Practice

Fear of catching Covid-19 dominates the world, so this week we briefly review how humans think about probabilities, in the context of Covid-19.

#### Bayesian Statistics

Bayesian statistics provides probability estimates of the true state of the world. An unremarkable statement, you might think -what else would statistics be for? But classical frequentist statistics, strictly speaking, only provide estimates of the state of a hothouse world, estimates that must be translated…

#### When Probabilities Sum to More than One

In 1998, Craig Fox and Amos Tversky reported on a survey in which U.S. basketball fans were asked to judge the probability that each of 8 teams might win the championship.  Students of statistics can probably guess the outcome - the probabilities for all the…

#### Student Spotlight: Paul Olszlyn

Meet Paul Olszlyn, Senior Data Scientist at NovoDynamics Paul Olsztyn designs and implements databases at NovoDynamics, a company that creates and deploys large scale data systems for corporations.  As his company responded to customer needs for more predictive analytics by building greater capacity in this…

#### May 26: Statistics in Practice

May 26: Statistics in Practice

This week we return to Coronavirus data to look at new analyses that use mobile phone data to estimate the effects of social distancing restrictions, a vital question now are we see the world falling into "lockdown recession."

#### Density

As Covid-19 continues to spread, so will research on its behavior.  Models that rely mainly on time-series data will expand to cover relevant other predictors (covariates), and one such predictor will be gregariousness.  How to measure it?  In psychology there is the standard personality trait…

#### Tracking Your Wanderings, for the Public Good

A recent development in the modeling of Covid-19 data has been the use of mobile phone location data, now available from Google, to estimate the degree to which social distancing restrictions have been implemented, and the effect they have had.   One interesting analysis comes from…

#### May 19: Statistics in Practice

May 19: Statistics in Practice

This week we take a look at evolutionary algorithms (it was 150 years ago that Charles Darwin first used the term "evolution" in his writings).

#### Instructor Spotlight: Wayne Folta

Wayne Folta is a Lead Data Scientist with Elder Research, a leading data science consulting company and the parent of Statistics.com.  Wayne’s current ongoing project involves the extraction, analysis and redaction of text.  For example, a healthcare organization might need to release records, stripped of…

#### Parameterized

Parameterized code in computer programs (or visualizations or spreadsheets) is code where the arguments being operated on are defined once as a parameter, at the beginning, so they do not have to be repeatedly explicitly defined in the body of the code.  This allows for…

#### Evolutionary Algorithms

It was 150 years ago when Darwin first used the term “evolution” in his writing (in his book The Descent of Man).  Two months ago, in The Normal Share of Paupers, I briefly discussed the unfortunate eugenics baggage that the discipline of statistics inherited from…

#### Student Spotlight: Timothy Young

Meet Timothy Young, a Contract Administrator for the County of Los Angeles Timothy recently started the Data Science Analytics Bachelor’s Degree program that Statistics.com offers in conjunction with Thomas Edison State University (TESU) and has already been able to put his learning to work.  At…

#### May 12: Statistics in Practice

May 12: Statistics in Practice

In this Brief, we dive into the terms "sensitivity" and "specificity" and their relatives.  In our course spotlight, clinical trials is the topic.

#### Sensitivity and Specificity

We defined these terms already (see this blog), but how can you remember which is which, so you don’t have to look them up?  If you can remember the order in which to recite them - sensitivity then specificity, it’s easy.  Think “positive and negative”…

#### May 5: Statistics in Practice

May 5: Statistics in Practice

In this week's Brief, we look deeper into the question of whether Covid-19 is a senior citizen disease.

#### COVID-19: Sensitivity, Specificity, and More

Covid-19 has brought statistical concepts and terms into the news as never before. One confusing tangle is the array of terms surrounding diagnostic test results.  The most basic is accuracy - what percent of test results are correct.  This is not necessarily the most important…

#### Decision Stumps

A decision stump is a decision tree with just one decision, leading to two or more leaves. For example, in this decision stump a borrower score of 0.475 or greater leads to a classification of “loan will default” while a borrower score less than 0.475…

#### Miasma

As more information arrives about the Coronavirus, researchers point more and more to airborne particles and aerosols as the mechanism of spread. Photographic images of a sneeze, such as this one from Lydia Bourouiba at MIT (source here), have been seen by many. It turns…

#### R0 (R-nought)

For infectious diseases, R0 (R-nought) is the unimpeded replication rate of the disease pathogen in a naive (not immune) population.  An R0 of 2 means that each person with the disease infects two others.  Some things to keep in mind:    An R0 of one means…

#### Apr 28: Statistics in Practice

Apr 28: Statistics in Practice

Models of virus growth are in the news, and this week we take a closer look at the modeling of epidemics

#### Conversations with Data Scientists about R and Python

Died-in-the-wool software developers can get quite passionate about the relative virtues of one programming language or another, their debates sometimes threatening to transport you back to middle-school arguments about the greatest ballplayers of all time.  Though their computer passions find other outlets as well, data…

#### Apr 21: Statistics in Practice

Apr 21: Statistics in Practice

In this week's Brief we take a look at Python vrs. R, and feature some conversations with data scientists.

#### Apr 14: Statistics in Practice

Apr 14: Statistics in Practice

In this week's Brief, we explore what data on the flu can tell us about Covid-19 counter-measures.

#### John Snow

John Snow is popularly regarded as the founder of the field of epidemiology, with his famous study of cholera in London.  Snow plotted cholera cases for a neighborhood served by two wells, and found that nearly all clustered around one of the wells, the Broad…

#### Apr 7: Statistics in Practice

Apr 7: Statistics in Practice

In this week's Brief, we look in greater detail at Elder Research, Inc., which recently acquired Statistics.com.

#### Observation and Quote from John Elder, IV

"The hype around Artificial Intelligence, Machine Learning, and Data Science is enormous, so it’s tempting to be skeptical of the return on investment (ROI) claimed. Still, most of the results are real. Organizations may suspect there is value in their data assets but not be…

#### Elder Research Capabilities

In late December, Statistics.com was acquired by Elder Research, Inc. Many of you have asked for more detail, so here’s an introduction to the folks at Elder Research and some stories of what they do.  There are 100+ employees at Elder Research, led by John…

#### Apr 2: Statistics in Practice – Special Epi Course

Apr 2: Statistics in Practice – Special Epi Course

In this special Brief we step back and look at various estimates of the projected death toll from the coronavirus.

#### Coronavirus Death Toll

There are tens of thousands of epidemiologists the world over, and we are beginning to see a bumper crop of forecasts for the ultimate 2020 death toll from Covid-19.  It’s a grim but important forecasting task. Most citizens would support draconian measures to prevent deaths…

#### Mar 31: Statistics in Practice

Mar 31: Statistics in Practice

In this week's Brief, we look at p-values.

#### P-Values – Are They Needed?

Five years ago last month, the psychology journal Basic and Applied Social Psychology instigated a major debate in statistical circles when it said it would remove p-value citations from papers it published.  A year later, the American Statistical Association (ASA) released a statement on p-values…

#### The Depression Gene

The risks of large-scale testing, and the potential for false discovery, can be seen in the “discovery” of the genetic basis for anxiety and depression.  Specifically, serotonin transporter gene 5-HTTLPR. Color Genomics sells a genetic testing product that supposedly can predict which anti-depressant drug works…

#### Hazard

In biostatistics, hazard, or the hazard rate, is the instantaneous rate of an event (death, failure…).  It is the probability of the event occurring in a (vanishingly) small period of time, divided by the amount of time (mathematically it is the limit of this quantity…