Problem of the Week: The Value of Bedrooms

Question: You work for an internet real-estate company, building statistical models to predict home price on the basis of square footage, number of bedrooms, number of bathrooms, property type (single family home, townhouse, multiplex), and age. Surprisingly, you find the coefficient for bedrooms is negative,…

Comments Off on Problem of the Week: The Value of Bedrooms

Statistically Significant – But Not True

If you are looking for the Feature Engineering blog post, you can find it here: In 2015, at an Alzheimer's conference, Biogen researchers presented dramatic brain scans showing that the antibody aducanumab effectively cleared out plaque in the brain, plaque that was associated with…

Comments Off on Statistically Significant – But Not True

Book Review: Everyone Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We REALLY Are

This week's book review is of Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are, Seth Stephens-Davidowitz's fascinating book about how social media data reveals all sorts of things about us that we barely know ourselves.…

Comments Off on Book Review: Everyone Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We REALLY Are

Industry Spotlight – Precision Agriculture

The application of analytics to agriculture has given rise to what is called "precision agriculture", a science that seeks to take advantage of and use detailed information that is local in time and place. Tractors and farm equipment are being equipped with sensors and software…

Comments Off on Industry Spotlight – Precision Agriculture

Historical Spotlight: Ronald A. Fisher

In 1919, Ronald A. Fisher was appointed as chief statistician at the agricultural research station in Rothamsted, a post created for him. His work there resulted, in 1925, in the publication of his classic Statistical Methods for Research Workers. An important message of his book…

Comments Off on Historical Spotlight: Ronald A. Fisher

Instructor Spotlight: Prof. David Unwin

Prof. David Unwin has guided, developed and taught the spatial analysis curriculum at since 2005. David lives in central England, about an hour north of the storied Rothamsted agricultural research center. Until his retirement in 2002, he was Professor of Geography at Birkbeck College,…

Comments Off on Instructor Spotlight: Prof. David Unwin

Statistics in Agriculture: Encycloweedia

Weeds are big business - the global herbicide market is over $35 billion annually. Weeds are also big government (think "invasive species"). California's listing of weeds is called Encycloweedia, and the state publishes a quarterly newsletter called Noxious Times. Colorado publishes a similar periodical, Invader.…

Comments Off on Statistics in Agriculture: Encycloweedia


A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor). 

Comments Off on Tensor

Problem of the Week: Missing Data

Question: You have a supervised learning task with 30 predictors, in which 5% of the observations are missing.  The missing data are randomly distributed across variables and records. If your strategy for coping with missing data is to drop records with missing data, what proportion…

Comments Off on Problem of the Week: Missing Data

Student Spotlight: Barry Eggleston

Barry Eggleston is a health research statistician who has worked on both clinical trials and observational studies, and is currently with RTI in North Carolina. In his early career, his work was solely designing and analyzing clinical trials using typical biostatistics methods ranging from t-test…

Comments Off on Student Spotlight: Barry Eggleston

A Deep Dive into Deep Learning

On Wednesday, March 27, the 2018 Turing Award in computing was given to Yoshua Bengio, Geoffrey Hinton and Yann LeCun for their work on deep learning. Deep learning by complex neural networks lies behind the applications that are finally bringing artificial intelligence out of the…

Comments Off on A Deep Dive into Deep Learning

Industry Spotlight: Credit Scoring

In the U.S., credit scoring is dominated by three companies - Experian, TransUnion and Equifax, employing roughly 30,000 people. An important player in the scoring methodology is FICO, previously Fair Isaac Corporation, and the scores are typically called "FICO scores." Credit scoring is the oldest…

Comments Off on Industry Spotlight: Credit Scoring

Book Review: Weapons of Math Destruction

Cathy O'Neil's Weapons of Math Destruction, when it was first published in 2016, sounded an early alarm about the big data algorithms and their potential for social evil. The cover is adorned with a robotic death's head and the subtitle reads "How Big Data Increases…

Comments Off on Book Review: Weapons of Math Destruction

Historical Spotlight: Alan Turing

80 years ago, in 1939, Alan Turing began work on the code-breaking system that would eventually prove key in helping Britain survive the German submarine threat in the Atlantic. Last month, the Turing Award in computer science prize (sometimes referred to as the "Nobel Prize…

Comments Off on Historical Spotlight: Alan Turing

Confusing Terms in Data Science – A Look at Synonyms

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing  synonyms, like these:

Comments Off on Confusing Terms in Data Science – A Look at Synonyms

Confusing Terms in Data Science – A Look at Homonyms and more

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these: 



Comments Off on Confusing Terms in Data Science – A Look at Homonyms and more

Confusing Terms in Data Science – A Look at Synonyms, Homonyms and more

To a statistician, a sample is a collection of observations (cases). To a machine learner, it's a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms and synonyms, like these: Homonyms (words with multiple meanings): Bias: To…

Comments Off on Confusing Terms in Data Science – A Look at Synonyms, Homonyms and more

Industry Spotlight: Package Delivery

Nothing better illustrates the encroachment of data science and analytics on the older "economy of tangible things" than the business of delivering packages. The use of analytics in package delivery is not new. Companies like UPS and Fedex are longtime users of operations research methods…

Comments Off on Industry Spotlight: Package Delivery

Job Spotlight: Sports Statistician

The field of sports statistician is not exactly new; the American Statistical Association's section on Sports Statistics was formed in 1992. Three of's instructors have professional experience in sports statistics - Ben Baumer (SQL) served as statistician for the NY Mets, Stephanie Kovalchik (Meta…

Comments Off on Job Spotlight: Sports Statistician

Industry Spotlight: Baseball – Opening Day & Statistics in Sports

The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame). Analytics has come a long way in…

Comments Off on Industry Spotlight: Baseball – Opening Day & Statistics in Sports

Jaquard’s coefficient

When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records.  One of them is the "yacht owner" problem.

Comments Off on Jaquard’s coefficient

Darwin’s Legacy in Statistics

Charles Darwin, the most famous grandson of the Enlightenment thinker Erasmus Darwin, published his ground-breaking theory of evolution, “The Origin of Species,”160 years ago. Another grandson of Erasmus, Francis Galton, became one of the founding fathers of statistics (correlation, the “wisdom of the crowd,” regression…

Comments Off on Darwin’s Legacy in Statistics

Industry Spotlight: Customer Segmentation

Are you "young and rustic?" Or perhaps a "toolbelt traditionalist?" These are nicknames given to customer segments identified by market research firm Claritas, with its statistical clustering tool. Long before the advent of individualized product recommendations, business sought to segment customers into distinct groups on…

Comments Off on Industry Spotlight: Customer Segmentation

Industry Spotlight: CROs

CRO's, or contract research organizations, are a $40 billion industry, growing at close to 12% per year. They provide contract services to the pharmaceutical industry, including statistical design and analysis, laboratory services, administration of clinical trials, and monitoring of drugs once they are on the…

Comments Off on Industry Spotlight: CROs

Handling the Noise – Boost It or Ignore It?

In most statistical modeling or machine learning prediction tasks, there will be cases that can be easily predicted based on their predictor values (signal), as well as cases where predictions are unclear (noise). Two statistical learning methods, boosting and ProfWeight, use those difficult cases in…

Comments Off on Handling the Noise – Boost It or Ignore It?

Rectangular data

Rectangular data are the staple of statistical and machine learning models.  Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.

Comments Off on Rectangular data

Industry Spotlight: Consulting

When a new technology arrives, consulting companies can quickly add staff and expertise to build institutional capacity centered around the technology in ways companies focused on delivering their own products and services cannot. Large consulting companies like Booz Allen and McKinsey, as well as smaller…

Comments Off on Industry Spotlight: Consulting

Good to Great

In 1994, Jim Collins and Jerry Porras, former and current Stanford professors, published the best-seller Built to Last that described how "long-term sustained performance can be engineered into the DNA of an enterprise."  It sold over a million copies. Buoyed by that success, Collins and a…

Comments Off on Good to Great

Instructor Spotlight: Cliff Ragsdale

Cliff T. Ragsdale teaches several courses for the Institute in the area of operations research, based on his best selling text "Spreadsheet Modeling and Decision Analysis." One of Cliff's special talents is making his subject, which can be quite challenging technically, widely accessible. His courses…

Comments Off on Instructor Spotlight: Cliff Ragsdale

Selection Bias

Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample.  It can occur in numerous situations, here are just a few:

Comments Off on Selection Bias

Space Shuttle Explosion

In 1986, the U.S. space shuttle Challenger exploded several minutes after launch. A later investigation found that the cause of the disaster was O-ring failure, due to cold temperatures. The temperature at launch was 39 degrees, colder than any prior launch. The cold caused the…

Comments Off on Space Shuttle Explosion

Alaskan Generosity

People in Alaska are extraordinarily generous - that's what a predictive model showed, when applied to a charitable organization's donor list. A closer examination revealed a flaw - while the original data was for all 50 states, the model's training data for Alaska included donors,…

Comments Off on Alaskan Generosity

Why Analytics Projects Fail – 5 Reasons

With the news full of so many successes in the fields of analytics, machine learning and artificial intelligence, it is easy to lose sight of the high failure rate of analytics projects. McKinsey just came out with a report that only 8% of big companies…

Comments Off on Why Analytics Projects Fail – 5 Reasons

Historical Spotlight – ISOQOL

25 years ago the International Society of Quality of Life Research was founded with a mission to advance the science of quality of life and related patient-centered outcomes in health research, care and policy. While focusing on quality of life (QOL) in healthcare may seem…

Comments Off on Historical Spotlight – ISOQOL

Likert Scale

A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly."  Two key decisions the survey designer faces are

  • How many gradients to allow, and

  • Whether to include a neutral midpoint

Comments Off on Likert Scale

Football Analytics

Preparing for the Superbowl Your team is at midfield, you have the ball, it's 4th down with 2 yards to go. Should you go for it? (Apologies in advance to our many readers, especially those outside the U.S., who are not aficionados of American football,…

Comments Off on Football Analytics

Job Spotlight: Digital Marketer

A digital marketer handles a variety of tasks in online marketing - managing online advertising and search engine optimization (SEO), implementing tracking systems (e.g. to identify how a person came to a retailer), web development, preparing creatives, implementing tests, and, of course, analytics. There are…

Comments Off on Job Spotlight: Digital Marketer

Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

Comments Off on Dummy Variable

Things are Getting Better

In the visualization below, which line do you think represents the UN's forecast for the number of children in the world in the year 2100? Hans Rosling, in his book Factfulness, presents this chart and notes that in a sample of Norwegian teachers, only 9%…

Comments Off on Things are Getting Better

Artificial Lawyers

Can statistical and machine learning methods replace lawyers? A host of entrepreneurs think so, and do the folks who run Text mining and predictive model products are available now to predict case staffing requirements and perform automated document discovery, and natural language algorithms conduct…

Comments Off on Artificial Lawyers

Entity Resolution and Identifying Bad Guys

Earlier, we described how Jen Golbeck (who teaches Network Analysis at analyzed Facebook connections to identify fake accounts (the account holders friends all had the same number of friends, which is highly improbable statistically). Network analysis and studying connections lie at the heart of…

Comments Off on Entity Resolution and Identifying Bad Guys

Work and Heat

If you are working on New Year's Eve or New Year's Day, odds are it is from home, where you can (usually) control the temperature in the home. Which, from the standpoint of productivity, is a good thing. According to a study from Cornell, raising…

Comments Off on Work and Heat


Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb.  With a pretense of being an individual selling a car on his or her own, and with no fixed…

Comments Off on Curbstoning

Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

Comments Off on Snowball Sampling

The False Alarm Conundrum

False alarms are one of the most poorly understood problems in applied statistics and biostatistics. The fundamental problem is the wide application of a statistical or diagnostic test in search of something that is relatively rare. Consider the Apple Watch's new feature that detects atrial…

Comments Off on The False Alarm Conundrum

Conditional Probability Word of the Week

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud").  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

Comments Off on Conditional Probability Word of the Week

Instructor Spotlight – David Kleinbaum

David Kleinbaum developed several courses for, including Survival Analysis, Epidemiologic Statistics, and Designing Valid Statistical Studies. David retired a little over a year ago from Emory University, where he was a popular and effective teacher with the ability to distill and explain difficult statistical…

Comments Off on Instructor Spotlight – David Kleinbaum

Book Review: Active-Epi

ActivEpi Web, by David Kleinbaum, is the text used in two courses (Epidemiology Statistics and Designing Valid Studies), but it is really a rich multimedia web-based presentation of epidemiological statistics, serving the role of a unique textbook format for an introductory course in the…

Comments Off on Book Review: Active-Epi


Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, "churns."

Comments Off on Churn

Survival Analysis

Convinced that he, like his father, would die in his 40's, Winston Churchill lived his early life in a frenetic hurry. He had participated in four wars on three continents by his mid-20's, served in multiple ministerial positions by his 30's, and published 12 books…

Comments Off on Survival Analysis

How Google Determines Which Ads you See

A classic machine learning task is to predict something's class, usually binary - pictures as dogs or cats, insurance claims as fraud or not, etc. Often the goal is not a final classification, but an estimate of the probability of belonging to a class (propensity),…

Comments Off on How Google Determines Which Ads you See

Job Spotlight: Data Scientist

Data science is one of a host of similar terms. Artificial intelligence has been around since the 1960's and data mining for at least a couple of decades. Machine learning came out of the computer science community, and analytics, data analytics, and predictive analytics came…

Comments Off on Job Spotlight: Data Scientist

ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:


Comments Off on ROC Curve

Deming’s Funnel Problem

W. Edwards Deming's funnel problem is one of statistics' greatest hits. Deming was a noted statistician who took the statistical process control methods of Shewhart and expanded them into a holistic approach to manufacturing quality. Initially, his ideas were cooly received in the US and…

Comments Off on Deming’s Funnel Problem

Industry Spotlight: the Auto Industry

The auto industry serves as a perfect exemplar of three key eras of statistics and data science in service of industry: Total Quality Management (TQM) First in Japan, and later in the U.S., the auto industry became an enthusiastic adherent to the Total Quality Management…

Comments Off on Industry Spotlight: the Auto Industry

Analytics Professionals – Must They Be Good Communicators?

Most job ads in the technical arena list communication among the sought-after skills; it consistently outranks many programming and analytical skills. Is it for real, or is it just thrown in there by the HR Department on general principle? The founder of a leading analytics…

Comments Off on Analytics Professionals – Must They Be Good Communicators?

Prospective vs. Retrospective

A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you're measuring, who you're measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition…

Comments Off on Prospective vs. Retrospective

The Evolution of Clinical Trials

Boiling oil versus egg yolks One early clinical trial was accidental. In the 16th century, a common treatment for wounded soldiers was to pour boiling oil on their wounds. In 1537, the surgeon Ambroise Pare, attending French soldiers, ran out of oil one evening. He…

Comments Off on The Evolution of Clinical Trials

GE Regresses to the Mean

Thirty years ago, GE became the brightest star in the firmament of statistical ideas in business when it adopted Six Sigma methods of quality improvement. Those methods had been introduced by Motorola, but Jack Welch's embrace of the same methods at GE, a diverse manufacturing…

Comments Off on GE Regresses to the Mean

Examples of Bad Forecasting

In a couple of days, theWall Street Journalwill come out with its November survey of economists' forecasts. It's a particularly sensitive time, with elections in a few days and President Trump attacking the Federal Reserve for for raising interest rates. It's a good time to…

Comments Off on Examples of Bad Forecasting

Historical Spotlight: Risk Simulation – Since 1946

Simulation - a Venerable History One of the most consequential and valuable analytical tools in business is simulation, which helps us make decisions in the face of uncertainty, such as these:   An airline knows on average, what proportion of ticketed passengers show up for a…

Comments Off on Historical Spotlight: Risk Simulation – Since 1946

“out-of-bag,” as in “out-of-bag error”

"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)

Comments Off on “out-of-bag,” as in “out-of-bag error”


I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

Comments Off on BOOTSTRAP

Same thing, different terms..

The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.


Comments Off on Same thing, different terms..

100 years of variance

It is 100 years since R A Fischer introduced the concept of "variance" (in his 1918 paper "The Correlation Between Relatives on the Supposition of Mendelian Inheritance"). There is much that statistics has given us in the century that followed. Randomized clinical trials, and the means…

Comments Off on 100 years of variance

Early Data Scientists

Casting back long before the advent of Deep Learning for the "founding fathers" of data science, at first glance you would rule out antecedents who long predate the computer and data revolutions of the last quarter century. But some consider John Tukey (right), the Princeton statistician…

Comments Off on Early Data Scientists

Python for Analytics

Python started out as a general purpose language when it was created in 1991 by Guido van Rossum. It was embraced early on by Google founders Sergei Brin and Larry Page ("Python where we can, C++ where we must" was reputedly their mantra). In 2006,…

Comments Off on Python for Analytics

Course Spotlight: Deep Learning

Deep learning is essentially "neural networks on steroids" and it lies at the core of the most intriguing and powerful applications of artificial intelligence. Facial recognition (which you encounter daily in Facebook and other social media) harnesses many levels of data science tools, including algorithms…

Comments Off on Course Spotlight: Deep Learning

Course Spotlight: Structural Equation Modelling (SEM)

SEM stands for "structural equation modeling," and we are fortunate to have Prof. Randall Schumacker teaching this subject at Randy created the Structural Equation Modeling (SEM) journal in 1994 and the Structural Equation Modeling Special Interest Group (SIG) at the American Educational Research Association…

Comments Off on Course Spotlight: Structural Equation Modelling (SEM)

Benford’s Law Applies to Online Social Networks

Fake social media accounts and Russian meddling in US elections have been in the news lately, with Mark Zuckerberg (Facebook founder) testifying this week before the US Congress. Dr. Jen Golbeck, who teaches Network Analysis at, published an ingenious way to determine whether a…

Comments Off on Benford’s Law Applies to Online Social Networks

The Real Facebook Controversy

Cambridge Analytica's wholesale scraping of Facebook user data is big news now, and people are shocked that personal data is being shared and traded on a massive scale on the internet. But the real issue with social media is not harming to individual users whose…

Comments Off on The Real Facebook Controversy

Masters Programs versus an Online Certificate in Data Science from

We just attended the analytics conference of INFORMS' (The Institute for Operations Research and the Management Sciences) this week in Baltimore, and they held a special meeting for directors of academic analytics programs to better align what universities are producing with what industry is seeking.…

Comments Off on Masters Programs versus an Online Certificate in Data Science from

Course Spotlight: Likert scale assessment surveys

Do you work with multiple choice tests, or Likert scale assessment surveys? Rasch methods help you construct linear measures from these forms of scored observations and analyze the results from such surveys and tests. "Practical Rasch Measurement - Core Topics" In this course, you will…

Comments Off on Course Spotlight: Likert scale assessment surveys

Course Spotlight: Customer Analytics in R

"The customer is always right" was the motto Selfridge's department store coined in 1909. "We'll tell the customer what they want" was Madison Avenue's mantra starting in the 1950's. Now data scientists like Karolis Urbonas help companies like Amazon (where he works in Europe as…

Comments Off on Course Spotlight: Customer Analytics in R

Course Spotlight: Spatial Statistics Using R

Have you ever needed to analyze data with a spatial component? Geographic clusters of disease, crimes, animals, plants, events?Or describing the spatial variation of something, and perhaps correlating it with some other predictor? Assessing whether the geographic distribution of something departs from randomness? Location data…

Comments Off on Course Spotlight: Spatial Statistics Using R

“Money and Brains” and “Furs and Station Wagons”

"Money and Brains" and "Furs and Station Wagons" were evocative customer shorthands that the marketing company Claritas came up with over a half century ago. These names, which facilitated the work of marketers and sales people, were shorthand descriptions of segments of customers identified through…

Comments Off on “Money and Brains” and “Furs and Station Wagons”

Course Spotlight: Text Mining

The term text mining is sometimes used in two different meanings in computational statistics: Using predictive modeling to label many documents (e.g. legal docs might be "relevant" or "not relevant") - this is what we call text mining. Using grammar and syntax to parse the…

Comments Off on Course Spotlight: Text Mining


Benford's law describes an expected distribution of the first digit in many naturally-occurring datasets.

Comments Off on BENFORD’S LAW


Contingency tables are tables of counts of events or things, cross-tabulated by row and column.



Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.



Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

Comments Off on SAMPLE



The easiest way to think of a spline is to first think of linear regression - a single linear relationship between an outcome variable and various predictor variables. 

Comments Off on SPLINE