Skip to content

Problem of the Week: Simpson’s Paradox – baseball

Question: A baseball team is comparing two of its hitters, Hernandez and Dimock. Hernandez hit .250 in 2017 and .275 in 2018. Dimock did worse in both years – .245 in 2017 and .270 in 2018. Overall, though, Dimock hit better across the two years, .263 versus .258 for Hernandez. How can this be? Answer:Continue reading “Problem of the Week: Simpson’s Paradox – baseball”

Industry Spotlight: Credit Scoring

In the U.S., credit scoring is dominated by three companies – Experian, TransUnion and Equifax, employing roughly 30,000 people.  An important player in the scoring methodology is FICO, previously Fair Isaac Corporation, and the scores are typically called “FICO scores.”  Credit scoring is the oldest application of predictive modeling, fulfilling a need that has beenContinue reading “Industry Spotlight: Credit Scoring”

Feature Engineering and Data Prep – Still Needed?

It is a truism of machine learning and predictive analytics that 80% of an analyst’s time is consumed in cleaning and preparing the needed data. I saw an estimate by a Google engineer that 25% of the time was spent just looking for the right data. A big part of this process is human-driven featureContinue reading “Feature Engineering and Data Prep – Still Needed?”

Problem of the Week: The Value of Bedrooms

Question: You work for an internet real-estate company, building statistical models to predict home price on the basis of square footage, number of bedrooms, number of bathrooms, property type (single family home, townhouse, multiplex), and age. Surprisingly, you find the coefficient for bedrooms is negative, meaning that adding bedrooms decreases value. What might account forContinue reading “Problem of the Week: The Value of Bedrooms”

Industry Spotlight: The IRS is Watching You

The IRS (U.S. Internal Revenue Service) has been using computers to choose tax returns for audit since 1962. Early on, the selection was rule-based, but the IRS turned to statistical modeling in 1969, using the oldest predictive analytics model in the toolbox – discriminant analysis. Discriminant analysis, a linear classification technique, was first proposed byContinue reading “Industry Spotlight: The IRS is Watching You”

Ethical Practice in Data Mining

Prior to the advent of internet-connected devices, the largest source of big data was public interaction on the internet. Social media users, as well as shoppers and searchers on the internet, make an implicit deal with the big companies that provide these services: users can take advantage of powerful search, shopping and social interaction toolsContinue reading “Ethical Practice in Data Mining”

Handling the Noise – Boost It or Ignore It?

In most statistical modeling or machine learning prediction tasks, there will be cases that can be easily predicted based on their predictor values (signal), as well as cases where predictions are unclear (noise). Two statistical learning methods, boosting and ProfWeight, use those difficult cases in exactly opposite ways – boosting up-weights them, and ProfWeight down-weightsContinue reading “Handling the Noise – Boost It or Ignore It?”

“Defiant” Supervision

How did the phrase “defiantly recommend”, as in “I defiantly recommend this product,” come into common usage on the internet? The answer is a good look inside the workings of supervised learning. Supervision, generally from humans, is instrumental in much of statistical and machine learning. Google’s precise search algorithms are not public, but the generalContinue reading ““Defiant” Supervision”

Alaskan Generosity

People in Alaska are extraordinarily generous – that’s what a predictive model showed, when applied to a charitable organization’s donor list. A closer examination revealed a flaw – while the original data was for all 50 states, the model’s training data for Alaska included donors, but excluded non-donors. The reason? The data was 99% non-donors,Continue reading “Alaskan Generosity”

Political Analytics and Microtargeting

The statistics of targeting individual voters with specific messages, as opposed to messaging that went to whole groups, began in the U.S over a decade ago with the Democrats. Political targeting is now an established business, or at least a discipline within the broader realm of political consulting. By 2016, the Republicans had surged wellContinue reading “Political Analytics and Microtargeting”

The Statistics of Persuasion

The Art of Persuasion is the title of more than one book in the self-help genre, books that have spawned blogs, podcasts, speaking gigs and more. But the science of persuasion is actually of more interest, because it produces useful rules that can be studied and deployed. Marketers and politicians have long been enthusiastic usersContinue reading “The Statistics of Persuasion”

Job Spotlight: Digital Marketer

A digital marketer handles a variety of tasks in online marketing – managing online advertising and search engine optimization (SEO), implementing tracking systems (e.g. to identify how a person came to a retailer), web development, preparing creatives, implementing tests, and, of course, analytics. There are typically three types of employers: Marketing agencies that contract outContinue reading “Job Spotlight: Digital Marketer”

The False Alarm Conundrum

False alarms are one of the most poorly understood problems in applied statistics and biostatistics. The fundamental problem is the wide application of a statistical or diagnostic test in search of something that is relatively rare. Consider the Apple Watch’s new feature that detects atrial fibrillation (afib). Among people with irregular heartbeats, Apple claims aContinue reading “The False Alarm Conundrum”

How Google Determines Which Ads you See

A classic machine learning task is to predict something’s class, usually binary – pictures as dogs or cats, insurance claims as fraud or not, etc. Often the goal is not a final classification, but an estimate of the probability of belonging to a class (propensity), so the cases can be ranked. A good example ofContinue reading “How Google Determines Which Ads you See”

Triage and Artificial Intelligence

Predictim is a service that scans potential babysitters’ social media and other online activity and issues them a score that parents can use to select babysitters. Jeff Chester, the executive director of the Center for Digital Democracy, commented: There’s a mad rush to seize the power of AI to make all kinds of decisions withoutContinue reading “Triage and Artificial Intelligence”

Examples of Bad Forecasting

In a couple of days, theWall Street Journalwill come out with its November survey of economists’ forecasts. It’s a particularly sensitive time, with elections in a few days and President Trump attacking the Federal Reserve for for raising interest rates. It’s a good time to recall major forecasting gaffes of the past. In 1987, best-sellingContinue reading “Examples of Bad Forecasting”

Be Smarter Than Your Devices: Learn About Big Data

When Apple CEO Tim Cook finally unveiled his company’s new Apple Watch in a widely-publicized rollout earlier this month, most of the press coverage centered on its cost ($349 to start) and whether it would be as popular among consumers as the iPod or iMac. Nitin Indurkhya saw things differently. “I think the most significantContinue reading “Be Smarter Than Your Devices: Learn About Big Data”

Big Data and Clinical Trials in Medicine

There was an interesting article a couple of weeks ago in the New York Times magazine section on the role that Big Data can play in treating patients — discovering things that clinical trials are too slow, too expensive, and too blunt to find. The story was about a very particular set of lupus symptoms,Continue reading “Big Data and Clinical Trials in Medicine”