Matching Algorithms

Some applications of machine learning and artificial intelligence are recognizably impressive - predicting future hospital readmission of discharged patients, for example, or diagnosing retinopathy. Others - self-driving cars, for example - seem almost magical. The matching problem, though, is one where your first reaction might…

Comments Off on Matching Algorithms

Instructor Spotlight: Cliff Ragsdale

Cliff T. Ragsdale teaches several courses for the Institute in the area of operations research, based on his best selling text “Spreadsheet Modeling and Decision Analysis.”  One of Cliff’s special talents is making his subject, which can be quite challenging technically, widely accessible. His courses do…

Comments Off on Instructor Spotlight: Cliff Ragsdale

Industry Spotlight: Consulting

When a new technology arrives, consulting companies can quickly add staff and expertise to build institutional capacity centered around the technology in ways companies focused on delivering their own products and services cannot.  Large consulting companies like Booz Allen and McKinsey, as well as smaller…

Comments Off on Industry Spotlight: Consulting

Industry Spotlight: Baseball (Sports) Statistics

The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame).  Analytics has come a long way in…

Comments Off on Industry Spotlight: Baseball (Sports) Statistics

Industry Spotlight: Agriculture

Weeds are big business - the global herbicide market is over $35 billion annually.  Weeds are also big government (think “invasive species”). California’s listing of weeds is called Encycloweedia, and the state publishes a quarterly newsletter called Noxious Times. Colorado publishes a similar periodical, Invader.…

Comments Off on Industry Spotlight: Agriculture

Industry Spotlight: Precision Agriculture

The application of analytics to agriculture has given rise to what is called “precision agriculture,” a science that seeks to take advantage of and use detailed information that is local in time and place.  Tractors and farm equipment are being equipped with sensors and software…

Comments Off on Industry Spotlight: Precision Agriculture

Job Spotlight: Risk Analyst

Many jobs are centered around risk management.  If you’re looking through job postings, of course, you’ll see lots of jobs whose purpose is to make sure that nothing bad happens - the equivalent of locking the doors and closing the windows.  More interesting from a…

Comments Off on Job Spotlight: Risk Analyst

Job Spotlight: Data Scientist

Data science is one of a host of similar terms.  “Artificial intelligence” has been around since the 1960’s and “data mining” for at least a couple of decades.  “Machine learning” came out of the computer science community, and “analytics,” “data analytics,” and “predictive analytics” came…

Comments Off on Job Spotlight: Data Scientist

Course Spotlight: Survival Analysis

Convinced that he, like his father, would die in his 40’s, Winston Churchill lived his early life in a frenetic hurry.  He had participated in four wars on three continents by his mid-20’s, served in multiple ministerial positions by his 30’s, and published 12 books…

Comments Off on Course Spotlight: Survival Analysis
"When I started teaching mandatory biostatistics classes in 1970 at UNC, I realized early on that a lot of kids didn't want to take a course they perceived as boring, so I kept things relaxed and fun."
Instructor Spotlight: David Kleinbaum

Industry Spotlight: Military Operations

Abraham Wald, a persecuted Jewish mathematician who fled Austria just before World War II, led an analysis of allied bombers returning from missions.  Hitherto, the Air Force had focused on reinforcing areas that showed the most damage on return. Wald convinced them instead to focus…

Comments Off on Industry Spotlight: Military Operations

Likert scale assessment surveys

Do you work with multiple choice tests, or Likert scale assessment surveys?  Rasch methods help you construct linear measures from these forms of scored observations and analyze the results from such surveys and tests.  "Practical Rasch Measurement - Core Topics" In this course, you will learn practical…

Comments Off on Likert scale assessment surveys

Historical Spotlight: Jacob Wolfowitz

World War II was a crucible of technological innovation, including advances in statistics. Jacob Wolfowitz, born a century ago (1920), looked at the problem of noisy radio transmissions. Coded radio transmissions were critical elements of military command and control, and they were plagued by the…

Comments Off on Historical Spotlight: Jacob Wolfowitz

Certificate Graduate: Cristobal Bazan, United Nations Agency

Certificate Student Profile of Cristobal Bazan My courses help me look at more complex problems using different approaches to show more interesting aspects of conditions, beyond just tables and charts, more than just sampling or descriptive statistics. Cristobal Bazan United Nations Agency How do you…

Comments Off on Certificate Graduate: Cristobal Bazan, United Nations Agency

Problem of the Week: The Value of Bedrooms

Question: You work for an internet real-estate company, building statistical models to predict home price on the basis of square footage, number of bedrooms, number of bathrooms, property type (single family home, townhouse, multiplex), and age. Surprisingly, you find the coefficient for bedrooms is negative,…

Comments Off on Problem of the Week: The Value of Bedrooms

Statistically Significant – But Not True

If you are looking for the Feature Engineering blog post, you can find it here: https://www.statistics.com/blog/1/1558369154-feature-engineering-data-prep-still-needed/ In 2015, at an Alzheimer's conference, Biogen researchers presented dramatic brain scans showing that the antibody aducanumab effectively cleared out plaque in the brain, plaque that was associated with…

Comments Off on Statistically Significant – But Not True

Book Review: Everyone Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We REALLY Are

This week's book review is of Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are, Seth Stephens-Davidowitz's fascinating book about how social media data reveals all sorts of things about us that we barely know ourselves. …

Comments Off on Book Review: Everyone Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We REALLY Are

Industry Spotlight – Precision Agriculture

The application of analytics to agriculture has given rise to what is called "precision agriculture", a science that seeks to take advantage of and use detailed information that is local in time and place. Tractors and farm equipment are being equipped with sensors and software…

Comments Off on Industry Spotlight – Precision Agriculture

Historical Spotlight: Ronald A. Fisher

In 1919, Ronald A. Fisher was appointed as chief statistician at the agricultural research station in Rothamsted, a post created for him. His work there resulted, in 1925, in the publication of his classic Statistical Methods for Research Workers. An important message of his book…

Comments Off on Historical Spotlight: Ronald A. Fisher

Instructor Spotlight: Prof. David Unwin

Prof. David Unwin has guided, developed and taught the spatial analysis curriculum at Statistics.com since 2005. David lives in central England, about an hour north of the storied Rothamsted agricultural research center. Until his retirement in 2002, he was Professor of Geography at Birkbeck College,…

Comments Off on Instructor Spotlight: Prof. David Unwin

Statistics in Agriculture: Encycloweedia

Weeds are big business - the global herbicide market is over $35 billion annually. Weeds are also big government (think "invasive species"). California's listing of weeds is called Encycloweedia, and the state publishes a quarterly newsletter called Noxious Times. Colorado publishes a similar periodical, Invader.…

Comments Off on Statistics in Agriculture: Encycloweedia

Tensor

A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor). 

Comments Off on Tensor

Problem of the Week: Missing Data

Question: You have a supervised learning task with 30 predictors, in which 5% of the observations are missing.  The missing data are randomly distributed across variables and records. If your strategy for coping with missing data is to drop records with missing data, what proportion…

Comments Off on Problem of the Week: Missing Data

Student Spotlight: Barry Eggleston

Barry Eggleston is a health research statistician who has worked on both clinical trials and observational studies, and is currently with RTI in North Carolina. In his early career, his work was solely designing and analyzing clinical trials using typical biostatistics methods ranging from t-test…

Comments Off on Student Spotlight: Barry Eggleston

A Deep Dive into Deep Learning

On Wednesday, March 27, the 2018 Turing Award in computing was given to Yoshua Bengio, Geoffrey Hinton and Yann LeCun for their work on deep learning. Deep learning by complex neural networks lies behind the applications that are finally bringing artificial intelligence out of the…

Comments Off on A Deep Dive into Deep Learning

Industry Spotlight: Credit Scoring

In the U.S., credit scoring is dominated by three companies - Experian, TransUnion and Equifax, employing roughly 30,000 people. An important player in the scoring methodology is FICO, previously Fair Isaac Corporation, and the scores are typically called "FICO scores." Credit scoring is the oldest…

Comments Off on Industry Spotlight: Credit Scoring

Book Review: Weapons of Math Destruction

Cathy O'Neil's Weapons of Math Destruction, when it was first published in 2016, sounded an early alarm about the big data algorithms and their potential for social evil. The cover is adorned with a robotic death's head and the subtitle reads "How Big Data Increases…

Comments Off on Book Review: Weapons of Math Destruction

Historical Spotlight: Alan Turing

80 years ago, in 1939, Alan Turing began work on the code-breaking system that would eventually prove key in helping Britain survive the German submarine threat in the Atlantic. Last month, the Turing Award in computer science prize (sometimes referred to as the "Nobel Prize…

Comments Off on Historical Spotlight: Alan Turing

Confusing Terms in Data Science – A Look at Synonyms

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing  synonyms, like these:

Comments Off on Confusing Terms in Data Science – A Look at Synonyms

Confusing Terms in Data Science – A Look at Homonyms and more

To a statistician, a sample is a collection of observations (cases).  To a machine learner, it’s a single observation.  Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these: 

 

 

Comments Off on Confusing Terms in Data Science – A Look at Homonyms and more

Confusing Terms in Data Science – A Look at Synonyms, Homonyms and more

To a statistician, a sample is a collection of observations (cases). To a machine learner, it's a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms and synonyms, like these: Homonyms (words with multiple meanings): Bias: To…

Comments Off on Confusing Terms in Data Science – A Look at Synonyms, Homonyms and more

Industry Spotlight: Package Delivery

Nothing better illustrates the encroachment of data science and analytics on the older "economy of tangible things" than the business of delivering packages. The use of analytics in package delivery is not new. Companies like UPS and Fedex are longtime users of operations research methods…

Comments Off on Industry Spotlight: Package Delivery

Job Spotlight: Sports Statistician

The field of sports statistician is not exactly new; the American Statistical Association's section on Sports Statistics was formed in 1992. Three of Statistics.com's instructors have professional experience in sports statistics - Ben Baumer (SQL) served as statistician for the NY Mets, Stephanie Kovalchik (Meta…

Comments Off on Job Spotlight: Sports Statistician

Industry Spotlight: Baseball – Opening Day & Statistics in Sports

The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame). Analytics has come a long way in…

Comments Off on Industry Spotlight: Baseball – Opening Day & Statistics in Sports

Jaquard’s coefficient

When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records.  One of them is the "yacht owner" problem.

Comments Off on Jaquard’s coefficient

Darwin’s Legacy in Statistics

Charles Darwin, the most famous grandson of the Enlightenment thinker Erasmus Darwin, published his ground-breaking theory of evolution, “The Origin of Species,”160 years ago. Another grandson of Erasmus, Francis Galton, became one of the founding fathers of statistics (correlation, the “wisdom of the crowd,” regression…

Comments Off on Darwin’s Legacy in Statistics

Industry Spotlight: Customer Segmentation

Are you "young and rustic?" Or perhaps a "toolbelt traditionalist?" These are nicknames given to customer segments identified by market research firm Claritas, with its statistical clustering tool. Long before the advent of individualized product recommendations, business sought to segment customers into distinct groups on…

Comments Off on Industry Spotlight: Customer Segmentation

Industry Spotlight: CROs

CRO's, or contract research organizations, are a $40 billion industry, growing at close to 12% per year. They provide contract services to the pharmaceutical industry, including statistical design and analysis, laboratory services, administration of clinical trials, and monitoring of drugs once they are on the…

Comments Off on Industry Spotlight: CROs

Handling the Noise – Boost It or Ignore It?

In most statistical modeling or machine learning prediction tasks, there will be cases that can be easily predicted based on their predictor values (signal), as well as cases where predictions are unclear (noise). Two statistical learning methods, boosting and ProfWeight, use those difficult cases in…

Comments Off on Handling the Noise – Boost It or Ignore It?

Rectangular data

Rectangular data are the staple of statistical and machine learning models.  Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.

Comments Off on Rectangular data

Industry Spotlight: Consulting

When a new technology arrives, consulting companies can quickly add staff and expertise to build institutional capacity centered around the technology in ways companies focused on delivering their own products and services cannot. Large consulting companies like Booz Allen and McKinsey, as well as smaller…

Comments Off on Industry Spotlight: Consulting

Good to Great

In 1994, Jim Collins and Jerry Porras, former and current Stanford professors, published the best-seller Built to Last that described how "long-term sustained performance can be engineered into the DNA of an enterprise."  It sold over a million copies. Buoyed by that success, Collins and a…

Comments Off on Good to Great

Instructor Spotlight: Cliff Ragsdale

Cliff T. Ragsdale teaches several courses for the Institute in the area of operations research, based on his best selling text "Spreadsheet Modeling and Decision Analysis." One of Cliff's special talents is making his subject, which can be quite challenging technically, widely accessible. His courses…

Comments Off on Instructor Spotlight: Cliff Ragsdale

Selection Bias

Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample.  It can occur in numerous situations, here are just a few:

Comments Off on Selection Bias

Space Shuttle Explosion

In 1986, the U.S. space shuttle Challenger exploded several minutes after launch. A later investigation found that the cause of the disaster was O-ring failure, due to cold temperatures. The temperature at launch was 39 degrees, colder than any prior launch. The cold caused the…

Comments Off on Space Shuttle Explosion

Alaskan Generosity

People in Alaska are extraordinarily generous - that's what a predictive model showed, when applied to a charitable organization's donor list. A closer examination revealed a flaw - while the original data was for all 50 states, the model's training data for Alaska included donors,…

Comments Off on Alaskan Generosity

Why Analytics Projects Fail – 5 Reasons

With the news full of so many successes in the fields of analytics, machine learning and artificial intelligence, it is easy to lose sight of the high failure rate of analytics projects. McKinsey just came out with a report that only 8% of big companies…

Comments Off on Why Analytics Projects Fail – 5 Reasons

Historical Spotlight – ISOQOL

25 years ago the International Society of Quality of Life Research was founded with a mission to advance the science of quality of life and related patient-centered outcomes in health research, care and policy. While focusing on quality of life (QOL) in healthcare may seem…

Comments Off on Historical Spotlight – ISOQOL

Likert Scale

A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale.  For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly."  Two key decisions the survey designer faces are

  • How many gradients to allow, and

  • Whether to include a neutral midpoint

Comments Off on Likert Scale

Football Analytics

Preparing for the Superbowl Your team is at midfield, you have the ball, it's 4th down with 2 yards to go. Should you go for it? (Apologies in advance to our many readers, especially those outside the U.S., who are not aficionados of American football,…

Comments Off on Football Analytics

Job Spotlight: Digital Marketer

A digital marketer handles a variety of tasks in online marketing - managing online advertising and search engine optimization (SEO), implementing tracking systems (e.g. to identify how a person came to a retailer), web development, preparing creatives, implementing tests, and, of course, analytics. There are…

Comments Off on Job Spotlight: Digital Marketer

Dummy Variable

A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category.  Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:

Comments Off on Dummy Variable

Things are Getting Better

In the visualization below, which line do you think represents the UN's forecast for the number of children in the world in the year 2100? Hans Rosling, in his book Factfulness, presents this chart and notes that in a sample of Norwegian teachers, only 9%…

Comments Off on Things are Getting Better

Artificial Lawyers

Can statistical and machine learning methods replace lawyers? A host of entrepreneurs think so, and do the folks who run www.artificiallawyer.com. Text mining and predictive model products are available now to predict case staffing requirements and perform automated document discovery, and natural language algorithms conduct…

Comments Off on Artificial Lawyers

Entity Resolution and Identifying Bad Guys

Earlier, we described how Jen Golbeck (who teaches Network Analysis at Statistics.com) analyzed Facebook connections to identify fake accounts (the account holders friends all had the same number of friends, which is highly improbable statistically). Network analysis and studying connections lie at the heart of…

Comments Off on Entity Resolution and Identifying Bad Guys

Work and Heat

If you are working on New Year's Eve or New Year's Day, odds are it is from home, where you can (usually) control the temperature in the home. Which, from the standpoint of productivity, is a good thing. According to a study from Cornell, raising…

Comments Off on Work and Heat

Curbstoning

Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb.  With a pretense of being an individual selling a car on his or her own, and with no fixed…

Comments Off on Curbstoning

Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

Comments Off on Snowball Sampling

The False Alarm Conundrum

False alarms are one of the most poorly understood problems in applied statistics and biostatistics. The fundamental problem is the wide application of a statistical or diagnostic test in search of something that is relatively rare. Consider the Apple Watch's new feature that detects atrial…

Comments Off on The False Alarm Conundrum

Conditional Probability

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud").  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

Comments Off on Conditional Probability

Instructor Spotlight – David Kleinbaum

David Kleinbaum developed several courses for Statistics.com, including Survival Analysis, Epidemiologic Statistics, and Designing Valid Statistical Studies. David retired a little over a year ago from Emory University, where he was a popular and effective teacher with the ability to distill and explain difficult statistical…

Comments Off on Instructor Spotlight – David Kleinbaum

Book Review: Active-Epi

ActivEpi Web, by David Kleinbaum, is the text used in two Statistics.com courses (Epidemiology Statistics and Designing Valid Studies), but it is really a rich multimedia web-based presentation of epidemiological statistics, serving the role of a unique textbook format for an introductory course in the…

Comments Off on Book Review: Active-Epi

Churn

Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, "churns."

Comments Off on Churn

Survival Analysis

Convinced that he, like his father, would die in his 40's, Winston Churchill lived his early life in a frenetic hurry. He had participated in four wars on three continents by his mid-20's, served in multiple ministerial positions by his 30's, and published 12 books…

Comments Off on Survival Analysis

How Google Determines Which Ads you See

A classic machine learning task is to predict something's class, usually binary - pictures as dogs or cats, insurance claims as fraud or not, etc. Often the goal is not a final classification, but an estimate of the probability of belonging to a class (propensity),…

Comments Off on How Google Determines Which Ads you See

Job Spotlight: Data Scientist

Data science is one of a host of similar terms. Artificial intelligence has been around since the 1960's and data mining for at least a couple of decades. Machine learning came out of the computer science community, and analytics, data analytics, and predictive analytics came…

Comments Off on Job Spotlight: Data Scientist

ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:

 

Comments Off on ROC Curve

Deming’s Funnel Problem

W. Edwards Deming's funnel problem is one of statistics' greatest hits. Deming was a noted statistician who took the statistical process control methods of Shewhart and expanded them into a holistic approach to manufacturing quality. Initially, his ideas were cooly received in the US and…

Comments Off on Deming’s Funnel Problem

Industry Spotlight: the Auto Industry

The auto industry serves as a perfect exemplar of three key eras of statistics and data science in service of industry: Total Quality Management (TQM) First in Japan, and later in the U.S., the auto industry became an enthusiastic adherent to the Total Quality Management…

Comments Off on Industry Spotlight: the Auto Industry

Analytics Professionals – Must They Be Good Communicators?

Most job ads in the technical arena list communication among the sought-after skills; it consistently outranks many programming and analytical skills. Is it for real, or is it just thrown in there by the HR Department on general principle? The founder of a leading analytics…

Comments Off on Analytics Professionals – Must They Be Good Communicators?

Prospective vs. Retrospective

A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you're measuring, who you're measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition…

Comments Off on Prospective vs. Retrospective

The Evolution of Clinical Trials

Boiling oil versus egg yolks One early clinical trial was accidental. In the 16th century, a common treatment for wounded soldiers was to pour boiling oil on their wounds. In 1537, the surgeon Ambroise Pare, attending French soldiers, ran out of oil one evening. He…

Comments Off on The Evolution of Clinical Trials

GE Regresses to the Mean

Thirty years ago, GE became the brightest star in the firmament of statistical ideas in business when it adopted Six Sigma methods of quality improvement. Those methods had been introduced by Motorola, but Jack Welch's embrace of the same methods at GE, a diverse manufacturing…

Comments Off on GE Regresses to the Mean

Examples of Bad Forecasting

In a couple of days, theWall Street Journalwill come out with its November survey of economists' forecasts. It's a particularly sensitive time, with elections in a few days and President Trump attacking the Federal Reserve for for raising interest rates. It's a good time to…

Comments Off on Examples of Bad Forecasting

Historical Spotlight: Risk Simulation – Since 1946

Simulation - a Venerable History One of the most consequential and valuable analytical tools in business is simulation, which helps us make decisions in the face of uncertainty, such as these:   An airline knows on average, what proportion of ticketed passengers show up for a…

Comments Off on Historical Spotlight: Risk Simulation – Since 1946

“out-of-bag,” as in “out-of-bag error”

"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)

Comments Off on “out-of-bag,” as in “out-of-bag error”

BOOTSTRAP

I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.

Comments Off on BOOTSTRAP

Same thing, different terms..

The field of data science is rife with terminology anomalies, arising from the fact that the field comes from multiple disciplines.

 

Comments Off on Same thing, different terms..
Close Menu