Blog

rss

Posted on Jun 13, 2019 By: Peter Bruce
You are doing a time series supervised learning prediction task.  You’ve normalized the data, and used several techniques to impute missing values.  After that, you split up the data into training and validation data, fit models to the training data, and assess how well they do on the validation data.  However, when you deploy the model, it is not as accurate with new data as it was with the validation data. What happened? Answer:  If you tried out lots of different models and then pick...
Posted on Jun 10, 2019 By: Peter Bruce
You’ve heard of the 80/20 rule - 80% of the (revenue, trouble, delays, etc.) comes from 20% of the (products, customers, routes, etc.).   Forecasting may be the most widely-used statistical method in business, and the 80/20 rule applies in this fashion - 80% of the practical analytical value in business forecasting comes from 20% of the methods.  Well, perhaps the key methods constitute more than 20% of all forecasting methods, but they are the simplest and most straightforward ones. The...
Posted on Jun 08, 2019 By: Galit Shmueli (guest post)
Galit Shmueli is Professor at National Tsing Hua University, Taiwan, author of Practical Time Series Forecasting, co-author of Data Mining for Business Analytics, and Statistics.com Instructor.    With the recent launch of Amazon Forecast, I can no longer procrastinate writing about forecasting "at scale"!   Quantitative forecasting of time series has been used (and taught) for decades, with applications in many areas of business such as demand forecasting, sales forecasting, and financia...
Posted on May 30, 2019 By: Peter Bruce
Hospitals are a major employer of statisticians and analytics professionals, both in support of clinical research like the retinopathy study described earlier, and to improve hospital operations (outcomes, cost management, etc.).  Here are a few quick facts about the hospital industry: US hospital revenue totals over $1 trillion - about 5% of GDP.  This is larger than the auto industry and on a par with the banking sector This revenue is split roughly 50/50 between inpatient and ou...
Posted on May 30, 2019 By: Peter Bruce
Perhaps the most active application of analytics and data mining is healthcare.  This week we look at one success story, the use of machine learning to predict diabetic retinopathy, one story of disappointment, the use of genetic testing in a puzzling disease, and a basic dichotomy in statistical analysis. In his famous 1977 book that introduced the idea of exploratory data analysis, John Tukey described two different strands of statistical analysis: Exploration, and Confirmation...
Posted on May 30, 2019 By: Peter Bruce
A baseball team is comparing two of its hitters, Hernandez and Dimock.  Hernandez hit .250 in 2017 and .275 in 2018. Dimock did worse in both years - .245 in 2017 and .270 in 2018.  Overall, though, Dimock hit better across the two years, .263 versus .258 for Hernandez. How can this be? Note that 2018 was a better year than 2017, for both hitters.  This is an example of Simpson’s Paradox, in which Dimock had relatively more at-bats in the better year, and relatively fewer in the worse ye...
Posted on May 24, 2019 By: Peter Bruce
Some applications of machine learning and artificial intelligence are recognizably impressive - predicting future hospital readmission of discharged patients, for example, or diagnosing retinopathy.  Others - self-driving cars, for example - seem almost magical. The matching problem, though, is one where your first reaction might be “What’s so hard about that?” For example, to take the application of finding duplicates, if a customer by the name of Elliot Sanderson places an order at a we...
Posted on May 22, 2019 By: Peter Bruce
World War II was a crucible of technological innovation, including advances in statistics.  Jacob Wolfowitz, born a century ago (1920), looked at the problem of noisy radio transmissions.  Coded radio transmissions were critical elements of military command and control, and they were plagued by the problem of atmospheric or other interference - “noise.”  The weaker the transmission and the longer the distance, the more likely it is that the signal will be lost in the noise. When the human...
Posted on May 20, 2019 By: Peter Bruce
It is a truism of machine learning and predictive analytics that 80% of an analyst’s time is consumed in cleaning and preparing the needed data.  I saw an estimate by a Google engineer that 25% of the time was spent just looking for the right data.   A big part of this process is human-driven feature engineering - distilling, transforming and curating the data to identify and extract variables that have predictive power.  A recent paper in Nature by a team of researchers from Google and se...
Posted on May 17, 2019 By: Peter Bruce
Many jobs are centered around risk management.  If you’re looking through job postings, of course, you’ll see lots of jobs whose purpose is to make sure that nothing bad happens - the equivalent of locking the doors and closing the windows.  More interesting from a statistical perspective are the jobs that assume that bad things will happen, and try to optimize and manage exposure. As far as leveraging the power of statistics and analytics, there are two strands, one leveraging skills from...
← Older post