Skip to content

Blog

Coronavirus – in Search of the Elusive Denominator

Anyone with internet access these days has their eyes on two constellations of data – the spread of the coronavirus, and the resulting collapse of the financial markets.  Following the 13% one-day drop of the stock market a week ago, The Wall Street Journal forecast a quarterly GDP drop of as much as 10% –Continue reading “Coronavirus – in Search of the Elusive Denominator”

Coronavirus: To Test or Not to Test

In recent years, under the influence of statisticians, the medical profession has dialed back on screening tests.  With relatively rare conditions, widespread testing yields many false positives and doctor visits, whose collective cost can outweigh benefits.  Coronavirus advice follows this line – testing is limited to the truly ill (this is also due to aContinue reading “Coronavirus: To Test or Not to Test”

Regularized Model

In building statistical and machine learning models, regularization is the addition of penalty terms to predictor coefficients to discourage complex models that would otherwise overfit the data.  An example is ridge regression.

Big Sample, Unreliable Result

Which would you rather have?  A large sample that is biased, or a representative sample that is small?  The American Statistical Association committee that reviewed the 1948 Kinsey report on male sexual behavior, based on interviews with over 5000 men, left no doubt of their preference for the latter.  The statisticians –  William Cochran, FrederickContinue reading “Big Sample, Unreliable Result”

Problem of the Week: Notify or Don’t Notify?

Our problem of the week is an ethical dilemma, posed by the New England Journal of Medicine to its readers 10 days ago.  Volunteers contributed DNA samples to investigators building a genetic database for study, on condition the data would be deidentified and kept confidential and that they themselves would not learn results.  Should participantsContinue reading “Problem of the Week: Notify or Don’t Notify?”

Factor

The term “factor” has different meanings in statistics that can be confusing because they conflict.   In statistical programming languages like R, factor acts as an adjective, used synonymously with categorical – a factor variable is the same thing as a categorical variable.  These factor variables have levels, which are the same thing as categories (aContinue reading “Factor”

Mixed Models – When to Use

Companies now have a lot of data on their customers at an individual level.  Suppose you are tasked with forecasting customer spending at a grocery chain, and you want to understand how customer attributes, local economic factors, and store issues affect customer spending. You could design your study with hierarchical and mixed linear modeling methodsContinue reading “Mixed Models – When to Use”

The Normal Share of Paupers

In 2009, China began regional pilot programs that repurposed credit scores to a broader purpose – scoring a person’s “social credit.”  100 years earlier, at the height of the eugenics craze, the famous statistician Francis Galton undertook to repurpose statistical concepts in service of social engineering. The starting point was a social survey of LondonContinue reading “The Normal Share of Paupers”

Purity

In classification, purity measures the extent to which a group of records share the same class.  It is also termed class purity or homogeneity, and sometimes impurity is measured instead.  The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where p = the proportion of records belonging to class 1. Continue reading “Purity”

Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value – they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value.  They are of limited utility in predictive modeling applications for various reasons: Software typicallyContinue reading “Predictor P-Values in Predictive Modeling”