Sept 24: Statistics in Practice

This week we take a look at the interesting statistical problem of false positives, which naturally arise when you do lots of diagnostic tests or hypothesis tests.  Our course spotlight deals with another aspect of multiple statistical studies - how to combine them into a…

Comments Off on Sept 24: Statistics in Practice

False Positive Rate – It’s Not What You Might Think

“A little knowledge is a dangerous thing,” said Alexander Pope in 1711; he could have been speaking of the use of statistics by experts in all fields. In this article, we look at three consequential mistakes in the field of statistics. Two of them are famous, the third required a deep dive into the corporate annual reports of

Comments Off on False Positive Rate – It’s Not What You Might Think

Famous Errors in Statistics

“A little knowledge is a dangerous thing,” said Alexander Pope in 1711; he could have been speaking of the use of statistics by experts in all fields. In this article, we look at three consequential mistakes in the field of statistics. Two of them are famous, the third required a deep dive into the corporate annual reports of

Comments Off on Famous Errors in Statistics

Puzzle: Surgery or Radiation

Several decades ago, the dominant therapies for lung cancer were radiation, which offered better short-term survival rates, and surgery, which offered better long-term rates. A thought experiment was conducted in which surgeons were randomly assigned to one of two groups and asked whether they would choose surgery. Group 1 was told: The one-month survival rate is 90%. Group 2 was told: There is 10% mortality in the first month. Yes, the two statements say the same thing. What did the two physician groups choose?

Comments Off on Puzzle: Surgery or Radiation

Sept 10: Statistics in Practice

This week we look at the second most popular percentage in statistics: 80%. Our course spotlight is on: Oct 30 –Nov 27: Sample Size and Power Determination See you in class! Peter Bruce Founder, Author, and Senior Scientist The Popular 80% Researchers and analysts are…

Comments Off on Sept 10: Statistics in Practice

Type III Error

Type I error in statistical analysis is incorrectly rejecting the null hypothesis - being fooled by random chance into thinking something interesting is happening.  The arcane machinery of statistical inference - significance testing and confidence intervals - was erected to avoid Type I error.  Type II error…

Comments Off on Type III Error

The Popular 80%

Researchers and analysts are familiar with the famous 5% benchmark in statistics, the typical probability threshold at which a result becomes statistically significant.  (The probability in question is the probability that a result as interesting as the real-life result will happen in the null model.) …

Comments Off on The Popular 80%

Sept 2: Statistics in Practice

This week, our topic is Data Engineering, and we feature a guest blog by Will Goodrum, a data scientist at Elder Research. Our course spotlight is Oct 2 -30: Categorical Data Analysis See you in class! Peter Bruce Founder, Author, and Senior Scientist Four Common…

Comments Off on Sept 2: Statistics in Practice

Four Common Pitfalls in Data Engineering

By Will Goodrum* Your company has made it a strategic priority to become more data-driven. Good! A major anticipated component of this transition is to implement new data technology (e.g., a data lake). Resources are thrown at identifying source systems and pulling information into a…

Comments Off on Four Common Pitfalls in Data Engineering

Relative Risk Ratio and Odds Ratio

The Relative Risk Ratio and Odds Ratio are both used to measure the medical effect of a treatment or variable to which people are exposed. The effect could be beneficial (from a therapy) or harmful (from a hazard).  Risk is the number of those having…

Comments Off on Relative Risk Ratio and Odds Ratio

Aug 25: Statistics in Practice

Vaccines for Covid are in the news, and this week we focus on the clinical trial process that validates vaccines as safe and effective.  Our spotlight is on our 10-course Biostatistics Certificate Program. You can get started with Jan 3-31:  Biostatistics 1 For Medical Science…

Comments Off on Aug 25: Statistics in Practice

Of Note: An outlier that lies in the middle of the data

An outlier or anomaly is typically defined as a case that is markedly distant or different from the bulk of the data.  Our July 28 blog on outliers and anomaly detection reported on one unusual case in which the outlier might lie fully within the…

Comments Off on Of Note: An outlier that lies in the middle of the data

Clinical Trial Process that Validates Vaccines as Safe and Effective

As of this writing, there are about 40 Coronavirus vaccines in the clinical trial process, plus another 135 in preclinical development. Russia has jumped the gun and “approved” a vaccine that has just begun Phase 3 trials, and, likewise, China has approved a pre-Phase 3…

Comments Off on Clinical Trial Process that Validates Vaccines as Safe and Effective

Aug 19: Statistics in Practice

Last week we looked at a notable failure of the statistic “AUC”; this week we dive deeper.  Our curriculum spotlight is our 10-course Analytics for Data Science certificate program: compare cost and coverage to ANY Master’s program!   You can get started with  Sep 11 – Oct…

Comments Off on Aug 19: Statistics in Practice

AUC: A Fatally Flawed Model Metric

By John Elder, Founder and Chair of Elder Research, Inc.  Last week, in Recidivism, and the Failure of AUC, we saw how the use of “Area Under the Curve” (AUC) concealed bias against African-Americans defendants in a model predicting recidivism, that is, which defendants would re-offend. …

Comments Off on AUC: A Fatally Flawed Model Metric

Recidivism, and the Failure of AUC

On average, 40% - 50% of convicted criminals in the U.S. go on to commit another crime (“recidivate”) after they are released.  For nearly 20 years, court systems have used statistical and machine learning algorithms to predict the probability of recidivism, and to guide sentencing…

Comments Off on Recidivism, and the Failure of AUC

Endpoint or Outcome (example: Covid-19 vaccine)

In a randomized experiment, the endpoint or outcome is a formal measure (statistic) of the result of the experiment.  In a randomized clinical trial preparatory to regulatory submission, there is often more than one outcome, due to the time and expense involved in conducting a…

Comments Off on Endpoint or Outcome (example: Covid-19 vaccine)

Aug 4: Statistics in Practice

In this week’s brief, we feature a data-detective story: The Case of the Faulty Generator.  Our spotlight is on our Analytics for Data Science certificate program*. See you in class! *Earn a Bachelor’s Degree in Data Science and Analytics concurrently at Thomas Edison State University.…

Comments Off on Aug 4: Statistics in Practice

Link Function

In generalized linear models, a link function maps a nonlinear relationship to a linear one so that a linear model can be fit (and then mapped to the original form).  For example, in logistic regression, we want to find the probability of success:  P(Y =…

Comments Off on Link Function

Sira-Kvina Hydro Power –The Case of the Faulty Generator

Prepared by Peter Bruce, Mark Smith and Ramon Perez, this case study was originally published at elderresearch.com.   In early 2020, Sira-Kvina Kraftselskap, a large producer of hydroelectric power in Norway, suffered a breakdownof one of its major generators. Company technicians went through established diagnostics…

Comments Off on Sira-Kvina Hydro Power –The Case of the Faulty Generator

Where Outliers are Central

In casual statistical analysis, you sometimes hear references to outliers, along with the suggestion that they should be ignored or dropped from the analysis.  Quite the contrary: often it is the outliers that convey useful information.  They may represent errors in data collection, e.g. a…

Comments Off on Where Outliers are Central

July 28: Statistics in Practice

In this week’s brief we discuss outliers and anomalies, the unusual cases and events that often end up being the focus of attention. Our course spotlight is Nov 6 - Dec 4: Anomaly Detection If you’re interested in this topic, you should also consider the…

Comments Off on July 28: Statistics in Practice

Small Ball: When a Downgrade is an Upgrade

In this mature age of digital marketing, companies have developed finely honed engines of automated and targeted promotion that factor in individual preferences and behavior.  The idea is to add small increments to revenue and profit. The system evolved in a stable era of economic…

Comments Off on Small Ball: When a Downgrade is an Upgrade

Three Myths in Data Science

Myth 1:  It’s All About Prediction “Who cares whether we understand the model - as long as it predicts well!” This was one of the seeming benefits of the era of big data and predictive modeling, and it set data science apart from traditional statistics.  …

Comments Off on Three Myths in Data Science

July 21: Statistics in Practice

In this week’s brief, a continuation of our “Statistical Thinking” series, we reflect on three “myths” in data science and statistics, and spotlight our ten-course Social Science Statistics certificate program. You can get started with either of these courses: Aug 7- Sep 4:  Survey Design and…

Comments Off on July 21: Statistics in Practice

July 7: Statistics in Practice

As Independence Day inaugurates the official summer political season in the U.S. (a season that, in reality, no longer ends), we discuss in this week’s brief uplift models; our course spotlight is on Aug 21 - Sep 18: Persuasion Analytics and Targeting See you in…

Comments Off on July 7: Statistics in Practice

Random Chance or Not?

On July 4, 1826, U.S. Independence Day, both John Adams and Thomas Jefferson, the second and third presidents of the U.S., both died within hours of each other.  Adams and Jefferson personified opposing factions in U.S. politics, with Adams favoring a strong central government and…

Comments Off on Random Chance or Not?

Model Interpretability

Model interpretability refers to the ability for a human to understand and articulate the relationship between a model’s predictors and its outcome.  For linear models, including linear and logistic regression, these relationships are seen directly in the model coefficients.  For black-box models like neural nets,…

Comments Off on Model Interpretability

Instructor Spotlight: Ken Strasma

Ken Strasma is a pioneer in the field of predictive analytics in high-stakes Presidential campaigns, serving as the National Targeting Director for President Obama’s historic 2008 campaign and for John Kerry’s 2004 presidential campaign. He produced the predictive analytics models used by the campaigns, and helped popularize…

Comments Off on Instructor Spotlight: Ken Strasma

Predicting “Do Not Disturbs”

In his book Predictive Analytics, Eric Siegel tells the story of marketing efforts at Telenor, a Norwegian telecom, to reduce churn (customers leaving for another carrier). Sophisticated analytics were used to guide the campaigns, but the managers gradually discovered that some campaigns were backfiring:  they…

Comments Off on Predicting “Do Not Disturbs”

June 30: Statistics in Practice

In this week’s Brief, the second in our series on statistical thinking, we discuss WWII convoys; our course spotlight is  July 10 - Aug 7: Spatial Statistics for GIS Using R  See you in class! Peter Bruce Founder, Author, and Senior Scientist Statistical Thinking 2  Safety…

Comments Off on June 30: Statistics in Practice

Polytomous

Polytomous, applied to variables (usually outcome variables), means multi-category (i.e. more than two categories).  Synonym:  multinomial. 

Comments Off on Polytomous

June 23: Statistics in Practice

In this week’s Brief, the first in a Statistical Thinking series, we look at how people think about rare events. Our spotlight is on: July 3 - 31: Introductory Statistics (another session starts July 31) See you in class! Peter Bruce Founder, Author, and Senior…

Comments Off on June 23: Statistics in Practice

Student Spotlight: Angelina Salinas

Meet Angelina Salinas, Data Analyst at Almacenes SIMAN Angelina Salinas started working for the retail store Almacenes Siman as a purchasing planner and, a couple of years later, got interested in data science and started to learn R. Shortly afterwards, the business intelligence group at…

Comments Off on Student Spotlight: Angelina Salinas

Historical Spotlight: Iris Dataset

Can you identify this wildflower, photographed in a Massachusetts field?  And also identify its significance in the history of statistics?  This is the Blue Flag Iris, also called the Veriscolor Iris, and it is one of three Iris species that make up the famous (in statistics) Iris…

Comments Off on Historical Spotlight: Iris Dataset

Rare Event Syndrome

Statistical Thinking 1   Several years ago, an NPR reporter wanted a comment from me for his story about an unusual event: a woman had won a state lottery jackpot for a second time. Winning once was low enough odds, but winning twice?   The reporter found…

Comments Off on Rare Event Syndrome

June 16: Statistics in Practice

In this week’s brief we feature a guest blog on Ethical Data Science; our course spotlight is: July 17 – Aug 14: Logistic Regression See you in class! Peter Bruce Founder, Author, and Senior Scientist Ethical Data Science As data science has evolved into AI,…

Comments Off on June 16: Statistics in Practice

Instructor Spotlight: Joseph Hilbe

Joseph Hilbe, a prolific author in the field of statistical modeling, taught a number of Statistics.com courses right up until his death, in March of 2017.  Hilbe was elected as a Fellow of the American Statistical Association; his expertise was in statistical modeling.  He did…

Comments Off on Instructor Spotlight: Joseph Hilbe

Ethical Data Science

Guest Blog - Grant Fleming, Data Scientist, Elder Research Progress in data science is largely driven by the ever-improving predictive performance of increasingly complex black-box models. However, these predictive gains have come at the expense of losing the ability to interpret the relationships derived between…

Comments Off on Ethical Data Science

June 12: Statistics in Practice

In this Brief, we visit the issue of “statistical arbitrage” in financial markets, and spotlight two courses: June 12 - July 10:  Financial Risk Modeling (today) July 10 - Aug 7:  Spatial Statistics for GIS Using R See you in class! P.S.  Our newest course,…

Comments Off on June 12: Statistics in Practice

Statistical Arbitrage

An economics professor and an engineering professor were walking across campus.  The engineering professor spots something lying in the grass - “Look- here’s a $20 bill!”  The economist doesn’t bother to look.  “It can’t be - somebody would have picked it up.” This old joke…

Comments Off on Statistical Arbitrage

June 2: Statistics in Practice

Fear of catching Covid-19 dominates the world, so this week we briefly review how humans think about probabilities, in the context of Covid-19.  Prior beliefs figure heavily in probability calculations, so our course spotlight is on:  July 3 - 31:  Introduction to Bayesian Statistics  See you…

Comments Off on June 2: Statistics in Practice

Bayesian Statistics

Bayesian statistics provides probability estimates of the true state of the world. An unremarkable statement, you might think -what else would statistics be for? But classical frequentist statistics, strictly speaking, only provide estimates of the state of a hothouse world, estimates that must be translated…

Comments Off on Bayesian Statistics

When Probabilities Sum to More than One

In 1998, Craig Fox and Amos Tversky reported on a survey in which U.S. basketball fans were asked to judge the probability that each of 8 teams might win the championship.  Students of statistics can probably guess the outcome - the probabilities for all the…

Comments Off on When Probabilities Sum to More than One

Student Spotlight: Paul Olszlyn

Meet Paul Olszlyn, Senior Data Scientist at NovoDynamics Paul Olsztyn designs and implements databases at NovoDynamics, a company that creates and deploys large scale data systems for corporations.  As his company responded to customer needs for more predictive analytics by building greater capacity in this…

Comments Off on Student Spotlight: Paul Olszlyn

May 26: Statistics in Practice

This week we return to Coronavirus data to look at new analyses that use mobile phone data to estimate the effects of social distancing restrictions, a vital question now are we see the world falling into “lockdown recession.”  Speaking of economic matters, our course spotlight…

Comments Off on May 26: Statistics in Practice

Density

As Covid-19 continues to spread, so will research on its behavior.  Models that rely mainly on time-series data will expand to cover relevant other predictors (covariates), and one such predictor will be gregariousness.  How to measure it?  In psychology there is the standard personality trait…

Comments Off on Density

Tracking Your Wanderings, for the Public Good

A recent development in the modeling of Covid-19 data has been the use of mobile phone location data, now available from Google, to estimate the degree to which social distancing restrictions have been implemented, and the effect they have had.   One interesting analysis comes from…

Comments Off on Tracking Your Wanderings, for the Public Good

May 19: Statistics in Practice

This week we take a look at evolutionary algorithms (it was 150 years ago that Charles Darwin first used the term “evolution” in his writings).  Our course spotlight is: July 17 - Aug 14:  Optimization with Linear Programming See you in class! - Peter Bruce…

Comments Off on May 19: Statistics in Practice

Instructor Spotlight: Wayne Folta

Wayne Folta is a Lead Data Scientist with Elder Research, a leading data science consulting company and the parent of Statistics.com.  Wayne’s current ongoing project involves the extraction, analysis and redaction of text.  For example, a healthcare organization might need to release records, stripped of…

Comments Off on Instructor Spotlight: Wayne Folta

Parameterized

Parameterized code in computer programs (or visualizations or spreadsheets) is code where the arguments being operated on are defined once as a parameter, at the beginning, so they do not have to be repeatedly explicitly defined in the body of the code.  This allows for…

Comments Off on Parameterized

Evolutionary Algorithms

It was 150 years ago when Darwin first used the term “evolution” in his writing (in his book The Descent of Man).  Two months ago, in The Normal Share of Paupers, I briefly discussed the unfortunate eugenics baggage that the discipline of statistics inherited from…

0 Comments

Student Spotlight: Timothy Young

Meet Timothy Young, a Contract Administrator for the County of Los Angeles Timothy recently started the Data Science Analytics Bachelor’s Degree program that Statistics.com offers in conjunction with Thomas Edison State University (TESU) and has already been able to put his learning to work.  At…

Comments Off on Student Spotlight: Timothy Young

May 12: Statistics in Practice

In this Brief, we dive into the terms “sensitivity” and “specificity” and their relatives.  In our course spotlight, clinical trials is the topic.  Now there’s a site just for the 800+ clinical trials associated with Covid-19 (treatments and vaccines).  Is it time for you to…

Comments Off on May 12: Statistics in Practice

Sensitivity and Specificity

We defined these terms already (see this blog), but how can you remember which is which, so you don’t have to look them up?  If you can remember the order in which to recite them - sensitivity then specificity, it’s easy.  Think “positive and negative”…

Comments Off on Sensitivity and Specificity

May 5: Statistics in Practice

In this week’s Brief, we look deeper into the question of whether Covid-19 is a senior citizen disease.  Our course spotlight is twofold: Start in May or June:  Mastery in Statistical Modeling (3 courses) June 12 to July 10  Analyzing and Modeling Covid-19 Data See…

Comments Off on May 5: Statistics in Practice

COVID-19: Sensitivity, Specificity, and More

Covid-19 has brought statistical concepts and terms into the news as never before. One confusing tangle is the array of terms surrounding diagnostic test results.  The most basic is accuracy - what percent of test results are correct.  This is not necessarily the most important…

Comments Off on COVID-19: Sensitivity, Specificity, and More

Decision Stumps

A decision stump is a decision tree with just one decision, leading to two or more leaves. For example, in this decision stump a borrower score of 0.475 or greater leads to a classification of “loan will default” while a borrower score less than 0.475…

Comments Off on Decision Stumps

Miasma

As more information arrives about the Coronavirus, researchers point more and more to airborne particles and aerosols as the mechanism of spread. Photographic images of a sneeze, such as this one from Lydia Bourouiba at MIT (source here), have been seen by many. It turns…

Comments Off on Miasma

R0 (R-nought)

For infectious diseases, R0 (R-nought) is the unimpeded replication rate of the disease pathogen in a naive (not immune) population.  An R0 of 2 means that each person with the disease infects two others.  Some things to keep in mind:    An R0 of one means…

Comments Off on R0 (R-nought)

Apr 28: Statistics in Practice

Models of virus growth are in the news, and this week we take a closer look at the modeling of epidemics, and introduce our newest course: June 12 to July 10  Analyzing and Modeling Covid-19 Data We’ll cover analysis of covid data broadly, and focus…

Comments Off on Apr 28: Statistics in Practice

Conversations with Data Scientists about R and Python

Died-in-the-wool software developers can get quite passionate about the relative virtues of one programming language or another, their debates sometimes threatening to transport you back to middle-school arguments about the greatest ballplayers of all time.  Though their computer passions find other outlets as well, data…

Comments Off on Conversations with Data Scientists about R and Python

Apr 21: Statistics in Practice

In this week’s Brief we take a look at Python vrs. R, and feature some conversations with data scientists.  Our spotlight is on our introductory statistical programming courses: May 15 - June 12:  Introduction to Python Programming May 15 - June 12:  R Programming Introduction…

Comments Off on Apr 21: Statistics in Practice

Apr 14: Statistics in Practice

In this week’s Brief, we explore what data on the flu can tell us about Covid-19 counter-measures.  Our course spotlight is July 31 - Sept 25:  Biostatistics See you in class! - Peter Bruce Founder, Author, and Senior Scientist Social Distancing and the Flu The…

Comments Off on Apr 14: Statistics in Practice

John Snow

John Snow is popularly regarded as the founder of the field of epidemiology, with his famous study of cholera in London.  Snow plotted cholera cases for a neighborhood served by two wells, and found that nearly all clustered around one of the wells, the Broad…

Comments Off on John Snow

Apr 7: Statistics in Practice

In this week’s Brief, we look in greater detail at Elder Research, Inc., which recently acquired Statistics.com.  If your organization is like most organizations, your data science initiatives may lack the direction and support they need to succeed - having a data science team does…

Comments Off on Apr 7: Statistics in Practice

Observation and Quote from John Elder, IV

"The hype around Artificial Intelligence, Machine Learning, and Data Science is enormous, so it’s tempting to be skeptical of the return on investment (ROI) claimed. Still, most of the results are real. Organizations may suspect there is value in their data assets but not be…

Comments Off on Observation and Quote from John Elder, IV

Elder Research Capabilities

In late December, Statistics.com was acquired by Elder Research, Inc. Many of you have asked for more detail, so here’s an introduction to the folks at Elder Research and some stories of what they do.  There are 100+ employees at Elder Research, led by John…

Comments Off on Elder Research Capabilities

Apr 2: Statistics in Practice – Special Epi Course

In this special Brief we step back and look at various estimates of the projected death toll from the coronavirus.   Would you like to learn more about the statistical analysis of disease?  We’re offering a special self-paced course to those seeking to improve their knowledge…

Comments Off on Apr 2: Statistics in Practice – Special Epi Course

Coronavirus Death Toll

There are tens of thousands of epidemiologists the world over, and we are beginning to see a bumper crop of forecasts for the ultimate 2020 death toll from Covid-19.  It’s a grim but important forecasting task. Most citizens would support draconian measures to prevent deaths…

Comments Off on Coronavirus Death Toll

Mar 31: Statistics in Practice

In this week’s Brief, we look at p-values.  Plus, we’ve scheduled a couple of extra course sessions for April:  Use the month of April to introduce yourself to Python, or, for those with some Python familiarity, learn how to apply it to predictive analytics. April…

Comments Off on Mar 31: Statistics in Practice

P-Values – Are They Needed?

Five years ago last month, the psychology journal Basic and Applied Social Psychology instigated a major debate in statistical circles when it said it would remove p-value citations from papers it published.  A year later, the American Statistical Association (ASA) released a statement on p-values…

Comments Off on P-Values – Are They Needed?

The Depression Gene

The risks of large-scale testing, and the potential for false discovery, can be seen in the “discovery” of the genetic basis for anxiety and depression.  Specifically, serotonin transporter gene 5-HTTLPR. Color Genomics sells a genetic testing product that supposedly can predict which anti-depressant drug works…

Comments Off on The Depression Gene

Hazard

In biostatistics, hazard, or the hazard rate, is the instantaneous rate of an event (death, failure…).  It is the probability of the event occurring in a (vanishingly) small period of time, divided by the amount of time (mathematically it is the limit of this quantity…

Comments Off on Hazard

Mar 24: Statistics in Practice

In this week’s Brief, we look again at the statistics of Coronavirus.  We also spotlight our Health Analytics Mastery - a 3-course series in which you can choose from among Biostatistics 1 and 2 Designing Valid Statistical Studies Epidemiologic Statistics * Introduction to Statistical Issues…

Comments Off on Mar 24: Statistics in Practice

Covid-19 Parameters

There are many moving parts in modeling the spread of an epidemic, a subject that has lately attracted the attention of great numbers of statistically-oriented non-epidemiologists (like me).  I’ve put together a “lay statistician’s guide” to some of the important parameters and factors (and I…

Comments Off on Covid-19 Parameters

Preliminary Paper

Here is a preliminary paper that suggests that RNA extraction kits, one of the main bottlenecks to Covid-19 testing in the US, can be skipped altogether and the next part of the assay (RT-qPCR) still works.  If confirmed, this result would have a major impact…

Comments Off on Preliminary Paper

Mar 18: Statistics in Practice

In this week’s Brief, we look at the coronavirus, and the problem of estimating prevalence and mortality.  Our course spotlight is Nov 8 - Dec 6:  Epidemiologic Statistics (we're adding a spring session - email us to be notified when registration opens at ourcourses@statistics.com) See…

Comments Off on Mar 18: Statistics in Practice

Standardized Death Rate

Often the death rate for a disease is fully known only for a group where the disease has been well studied.  For example, the 3711 passengers on the Diamond Princess cruise ship are, to date, the most fully studied coronavirus population.  All passengers were tested…

0 Comments

Coronavirus – in Search of the Elusive Denominator

Anyone with internet access these days has their eyes on two constellations of data - the spread of the coronavirus, and the resulting collapse of the financial markets.  Following the 13% one-day drop of the stock market a week ago, The Wall Street Journal forecast…

Comments Off on Coronavirus – in Search of the Elusive Denominator

Coronavirus: To Test or Not to Test

In recent years, under the influence of statisticians, the medical profession has dialed back on screening tests.  With relatively rare conditions, widespread testing yields many false positives and doctor visits, whose collective cost can outweigh benefits.  Coronavirus advice follows this line - testing is limited…

Comments Off on Coronavirus: To Test or Not to Test

Mar 16: Statistics in Practice

In this week’s Brief, we look at combining models.  Our course spotlight is April 17 - May 1:  Maximum Likelihood Estimation (MLE) You’ve probably seen lots of references to MLE in other contexts - this quick 2-week course (only $299) is your chance to study…

Comments Off on Mar 16: Statistics in Practice

Regularized Model

In building statistical and machine learning models, regularization is the addition of penalty terms to predictor coefficients to discourage complex models that would otherwise overfit the data.  An example is ridge regression.

Comments Off on Regularized Model

Ensemble Learning

In his book, The Wisdom of Crowds, James Surowiecki recounts how Francis Galton, a prominent statistician from the 19th century, attended an event at a country fair in England where the object was to guess the weight of an ox.   Individual contestants were relatively well…

Comments Off on Ensemble Learning

Mar 9: Statistics in Practice

In this week’s Brief, we look at ways to determine optimal sample size.  Our course spotlight is April 10 - May 8:  Sample Size and Power Determination See you in class! - Peter Bruce Founder, Author, and Senior Scientist Big Sample, Unreliable Result The 1948…

Comments Off on Mar 9: Statistics in Practice

Ridge Regression

Ridge regression is a method of penalizing coefficients in a regression model to force a more parsimonious model (one with fewer predictors) than would be produced by an ordinary least squares model. The term “ridge” was applied by Arthur Hoerl in 1970, who saw similarities…

Comments Off on Ridge Regression

Big Sample, Unreliable Result

Which would you rather have?  A large sample that is biased, or a representative sample that is small?  The American Statistical Association committee that reviewed the 1948 Kinsey report on male sexual behavior, based on interviews with over 5000 men, left no doubt of their…

Comments Off on Big Sample, Unreliable Result

Mar 2: Statistics in Practice

In this week’s Brief, we look at hierarchical and mixed models.  Our course spotlight is April 10 - May 8:  Generalized Linear Models April 24 - May 22:  Mixed and Hierarchical Linear Models See you in class! - Peter Bruce Founder, Author, and Senior Scientist…

Comments Off on Mar 2: Statistics in Practice

Problem of the Week: Notify or Don’t Notify?

Our problem of the week is an ethical dilemma, posed by the New England Journal of Medicine to its readers 10 days ago.  Volunteers contributed DNA samples to investigators building a genetic database for study, on condition the data would be deidentified and kept confidential…

Comments Off on Problem of the Week: Notify or Don’t Notify?

Factor

The term “factor” has different meanings in statistics that can be confusing because they conflict.   In statistical programming languages like R, factor acts as an adjective, used synonymously with categorical - a factor variable is the same thing as a categorical variable.  These factor variables…

Comments Off on Factor

Mixed Models – When to Use

Companies now have a lot of data on their customers at an individual level.  Suppose you are tasked with forecasting customer spending at a grocery chain, and you want to understand how customer attributes, local economic factors, and store issues affect customer spending. You could…

Comments Off on Mixed Models – When to Use

Feb 24: Statistics in Practice

In this week’s Brief, we look at social categories, and the role that statistics and data science have played in social engineering - 100 years ago and today.  Our course spotlight is April 3 - May 1:  Categorical Data Analysis See you in class! -…

Comments Off on Feb 24: Statistics in Practice

The Normal Share of Paupers

In 2009, China began regional pilot programs that repurposed credit scores to a broader purpose - scoring a person’s “social credit.”  100 years earlier, at the height of the eugenics craze, the famous statistician Francis Galton undertook to repurpose statistical concepts in service of social…

Comments Off on The Normal Share of Paupers

Purity

In classification, purity measures the extent to which a group of records share the same class.  It is also termed class purity or homogeneity, and sometimes impurity is measured instead.  The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where…

Comments Off on Purity

Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value - they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value.  They are of limited…

Comments Off on Predictor P-Values in Predictive Modeling

UpLift and Persuasion

The goal of any direct mail campaign, or other messaging effort, is to persuade somebody to do something.  In the business world, it is usually to buy something. In the political world, it is usually to vote for someone (or, if you think you know…

Comments Off on UpLift and Persuasion

Feb 17: Statistics in Practice

Last week we looked at several metrics for assessing the performance of classification models - accuracy, receiver operating characteristics (ROC) curves, and lift (gains).  In this week’s Brief we move beyond lift and cover uplift. Our course spotlight again is: Feb 28 - Mar 27:…

Comments Off on Feb 17: Statistics in Practice

ROC, Lift and Gains Curves

There are various metrics for assessing the performance of a classification model.  It matters which one you use. The simplest is accuracy - the proportion of cases correctly classified.  In classification tasks where the outcome of interest (“1”) is rare, though, accuracy as a metric…

Comments Off on ROC, Lift and Gains Curves

Feb 10: Statistics in Practice

Tomorrow is the New Hampshire political primary in the US, and this week’s Brief looks at the statistical concept of lift.  Our spotlight is on: Feb 28 - Mar 27:   Persuasion Analytics and Targeting See you in class! - Peter Bruce, Founder Lift and…

Comments Off on Feb 10: Statistics in Practice
Close Menu