Probability
You might be wondering why such a basic word as probability appears here. It turns out that the term has deep tendrils in formal mathematics and philosophy, but is somewhat hard to pin down
You might be wondering why such a basic word as probability appears here. It turns out that the term has deep tendrils in formal mathematics and philosophy, but is somewhat hard to pin down
In hospitals, “sentinel events” are events that carry with them a significant risk of unexpected death or harm. It is estimated that ⅔ of such sentinel events result from communications failures during the handoff of a patient from one provider to another (e.g. during a…
Density is a metric that describes how well-connected a network is
We have an extensive statistical glossary and have been sending out a "word of the week" newsfeed for a number of years. Take a look at the results
Consider the multi-arm bandit problem where each arm has an unknown probability of paying either 0 or 1, and a specified payoff discount factor of x (i.e. for two successive payoffs, the second is valued at x% of the first, where x < 100%). The Gittens index is [...]
There are various ways to recommend additional products to an online purchaser, and the most effective ones rely on prior purchase or rating history -
Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.
Hospitals are a major employer of statisticians and analytics professionals, both in support of clinical research like the retinopathy study described earlier, and to improve hospital operations (outcomes, cost management, etc.). Here are a few quick facts about the hospital industry: US hospital revenue totals…
Perhaps the most active application of analytics and data mining is healthcare. This week we look at one success story, the use of machine learning to predict diabetic retinopathy, one story of disappointment, the use of genetic testing in a puzzling disease, and a basic…
Question: A baseball team is comparing two of its hitters, Hernandez and Dimock. Hernandez hit .250 in 2017 and .275 in 2018. Dimock did worse in both years - .245 in 2017 and .270 in 2018. Overall, though, Dimock hit better across the two years,…
Some applications of machine learning and artificial intelligence are recognizably impressive - predicting future hospital readmission of discharged patients, for example, or diagnosing retinopathy. Others - self-driving cars, for example - seem almost magical. The matching problem, though, is one where your first reaction might…
Cliff T. Ragsdale teaches several courses for the Institute in the area of operations research, based on his best selling text “Spreadsheet Modeling and Decision Analysis.” One of Cliff’s special talents is making his subject, which can be quite challenging technically, widely accessible. His courses do…
When a new technology arrives, consulting companies can quickly add staff and expertise to build institutional capacity centered around the technology in ways companies focused on delivering their own products and services cannot. Large consulting companies like Booz Allen and McKinsey, as well as smaller…
The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame). Analytics has come a long way in…
Nothing better illustrates the encroachment of data science and analytics on the older “economy of tangible things” than the business of delivering packages. The use of analytics in package delivery is not new. Companies like UPS and Fedex are longtime users of operations research methods…
In the U.S., credit scoring is dominated by three companies - Experian, TransUnion and Equifax, employing roughly 30,000 people. An important player in the scoring methodology is FICO, previously Fair Isaac Corporation, and the scores are typically called “FICO scores.” Credit scoring is the oldest…
Weeds are big business - the global herbicide market is over $35 billion annually. Weeds are also big government (think “invasive species”). California’s listing of weeds is called Encycloweedia, and the state publishes a quarterly newsletter called Noxious Times. Colorado publishes a similar periodical, Invader.…
The application of analytics to agriculture has given rise to what is called “precision agriculture,” a science that seeks to take advantage of and use detailed information that is local in time and place. Tractors and farm equipment are being equipped with sensors and software…
Many jobs are centered around risk management. If you’re looking through job postings, of course, you’ll see lots of jobs whose purpose is to make sure that nothing bad happens - the equivalent of locking the doors and closing the windows. More interesting from a…
The auto industry serves as a perfect exemplar of three key eras of statistics and data science in service of industry: Total Quality Management (TQM) First in Japan, and later in the U.S., the auto industry became an enthusiastic adherent to the Total Quality Management…
Data science is one of a host of similar terms. “Artificial intelligence” has been around since the 1960’s and “data mining” for at least a couple of decades. “Machine learning” came out of the computer science community, and “analytics,” “data analytics,” and “predictive analytics” came…
Convinced that he, like his father, would die in his 40’s, Winston Churchill lived his early life in a frenetic hurry. He had participated in four wars on three continents by his mid-20’s, served in multiple ministerial positions by his 30’s, and published 12 books…
Do you work with multiple choice tests, or Likert scale assessment surveys? Rasch methods help you construct linear measures from these forms of scored observations and analyze the results from such surveys and tests. "Practical Rasch Measurement - Core Topics" In this course, you will learn practical…
World War II was a crucible of technological innovation, including advances in statistics. Jacob Wolfowitz, born a century ago (1920), looked at the problem of noisy radio transmissions. Coded radio transmissions were critical elements of military command and control, and they were plagued by the…
The Statistics.com courses have helped me a lot, pushing me to the limit and making me learn much more than I expected I could. The knowledge I gained I could immediately leverage in my job ... then eventually led to landing a job in my…
Certificate Student Profile of Cristobal Bazan My courses help me look at more complex problems using different approaches to show more interesting aspects of conditions, beyond just tables and charts, more than just sampling or descriptive statistics. Cristobal Bazan United Nations Agency How do you…
It is a truism of machine learning and predictive analytics that 80% of an analyst's time is consumed in cleaning and preparing the needed data. I saw an estimate by a Google engineer that 25% of the time was spent just looking for the right…
Many jobs are centered around risk management. If you're looking through job postings, of course, you'll see lots of jobs whose purpose is to make sure that nothing bad happens - the equivalent of locking the doors and closing the windows. More interesting from a…
Question: You work for an internet real-estate company, building statistical models to predict home price on the basis of square footage, number of bedrooms, number of bathrooms, property type (single family home, townhouse, multiplex), and age. Surprisingly, you find the coefficient for bedrooms is negative,…
The cost of bringing a new drug to market is over $2 billion, by some estimates. This covers the R&D, clinical trial testing and regulatory approval costs of the drug that makes it through the whole process, and also the same costs of the 9…
If you are looking for the Feature Engineering blog post, you can find it here: https://www.statistics.com/feature-engineering-data-prep-still-needed/ In 2015, at an Alzheimer's conference, Biogen researchers presented dramatic brain scans showing that the antibody aducanumab effectively cleared out plaque in the brain, plaque that was associated with…
This week's book review is of Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are, Seth Stephens-Davidowitz's fascinating book about how social media data reveals all sorts of things about us that we barely know ourselves.…
The application of analytics to agriculture has given rise to what is called "precision agriculture", a science that seeks to take advantage of and use detailed information that is local in time and place. Tractors and farm equipment are being equipped with sensors and software…
In 1919, Ronald A. Fisher was appointed as chief statistician at the agricultural research station in Rothamsted, a post created for him. His work there resulted, in 1925, in the publication of his classic Statistical Methods for Research Workers. An important message of his book…
Prof. David Unwin has guided, developed and taught the spatial analysis curriculum at Statistics.com since 2005. David lives in central England, about an hour north of the storied Rothamsted agricultural research center. Until his retirement in 2002, he was Professor of Geography at Birkbeck College,…
Weeds are big business - the global herbicide market is over $35 billion annually. Weeds are also big government (think "invasive species"). California's listing of weeds is called Encycloweedia, and the state publishes a quarterly newsletter called Noxious Times. Colorado publishes a similar periodical, Invader.…
A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor).
Question: You have a supervised learning task with 30 predictors, in which 5% of the observations are missing. The missing data are randomly distributed across variables and records. If your strategy for coping with missing data is to drop records with missing data, what proportion…
Barry Eggleston is a health research statistician who has worked on both clinical trials and observational studies, and is currently with RTI in North Carolina. In his early career, his work was solely designing and analyzing clinical trials using typical biostatistics methods ranging from t-test…
On Wednesday, March 27, the 2018 Turing Award in computing was given to Yoshua Bengio, Geoffrey Hinton and Yann LeCun for their work on deep learning. Deep learning by complex neural networks lies behind the applications that are finally bringing artificial intelligence out of the…
In the U.S., credit scoring is dominated by three companies - Experian, TransUnion and Equifax, employing roughly 30,000 people. An important player in the scoring methodology is FICO, previously Fair Isaac Corporation, and the scores are typically called "FICO scores." Credit scoring is the oldest…
The IRS (U.S. Internal Revenue Service) has been using computers to choose tax returns for audit since 1962. Early on, the selection was rule-based, but the IRS turned to statistical modeling in 1969, using the oldest predictive analytics model in the toolbox - discriminant analysis.…
Cathy O'Neil's Weapons of Math Destruction, when it was first published in 2016, sounded an early alarm about the big data algorithms and their potential for social evil. The cover is adorned with a robotic death's head and the subtitle reads "How Big Data Increases…
80 years ago, in 1939, Alan Turing began work on the code-breaking system that would eventually prove key in helping Britain survive the German submarine threat in the Atlantic. Last month, the Turing Award in computer science prize (sometimes referred to as the "Nobel Prize…
To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing synonyms, like these:
To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these:
To a statistician, a sample is a collection of observations (cases). To a machine learner, it's a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms and synonyms, like these: Homonyms (words with multiple meanings): Bias: To…
Nothing better illustrates the encroachment of data science and analytics on the older "economy of tangible things" than the business of delivering packages. The use of analytics in package delivery is not new. Companies like UPS and Fedex are longtime users of operations research methods…
Prior to the advent of internet-connected devices, the largest source of big data was public interaction on the internet. Social media users, as well as shoppers and searchers on the internet, make an implicit deal with the big companies that provide these services: users can…
The field of sports statistician is not exactly new; the American Statistical Association's section on Sports Statistics was formed in 1992. Three of Statistics.com's instructors have professional experience in sports statistics - Ben Baumer (SQL) served as statistician for the NY Mets, Stephanie Kovalchik (Meta…
The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame). Analytics has come a long way in…
When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records. One of them is the "yacht owner" problem.
Charles Darwin, the most famous grandson of the Enlightenment thinker Erasmus Darwin, published his ground-breaking theory of evolution, “The Origin of Species,”160 years ago. Another grandson of Erasmus, Francis Galton, became one of the founding fathers of statistics (correlation, the “wisdom of the crowd,” regression…
Are you "young and rustic?" Or perhaps a "toolbelt traditionalist?" These are nicknames given to customer segments identified by market research firm Claritas, with its statistical clustering tool. Long before the advent of individualized product recommendations, business sought to segment customers into distinct groups on…
CRO's, or contract research organizations, are a $40 billion industry, growing at close to 12% per year. They provide contract services to the pharmaceutical industry, including statistical design and analysis, laboratory services, administration of clinical trials, and monitoring of drugs once they are on the…
In most statistical modeling or machine learning prediction tasks, there will be cases that can be easily predicted based on their predictor values (signal), as well as cases where predictions are unclear (noise). Two statistical learning methods, boosting and ProfWeight, use those difficult cases in…
Your country is at war, and an enemy plane has crashed on your territory. It bears the number 60, and a spy has told you that the aircraft are numbered serially. Can you make a guess about the total number of aircraft the enemy has…
Rectangular data are the staple of statistical and machine learning models. Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.
How did the phrase "defiantly recommend", as in "I defiantly recommend this product," come into common usage on the internet? The answer is a good look inside the workings of supervised learning. Supervision, generally from humans, is instrumental in much of statistical and machine learning.…
When a new technology arrives, consulting companies can quickly add staff and expertise to build institutional capacity centered around the technology in ways companies focused on delivering their own products and services cannot. Large consulting companies like Booz Allen and McKinsey, as well as smaller…
In 1994, Jim Collins and Jerry Porras, former and current Stanford professors, published the best-seller Built to Last that described how "long-term sustained performance can be engineered into the DNA of an enterprise." It sold over a million copies. Buoyed by that success, Collins and a…
Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample. It can occur in numerous situations, here are just a few:
In 1986, the U.S. space shuttle Challenger exploded several minutes after launch. A later investigation found that the cause of the disaster was O-ring failure, due to cold temperatures. The temperature at launch was 39 degrees, colder than any prior launch. The cold caused the…
People in Alaska are extraordinarily generous - that's what a predictive model showed, when applied to a charitable organization's donor list. A closer examination revealed a flaw - while the original data was for all 50 states, the model's training data for Alaska included donors,…
Abraham Wald, a persecuted Jewish mathematician who fled Austria just before World War II, led an analysis of allied bombers returning from missions. Hitherto, the Air Force had focused on reinforcing areas that showed the most damage on return. Wald convinced them instead to focus…
With the news full of so many successes in the fields of analytics, machine learning and artificial intelligence, it is easy to lose sight of the high failure rate of analytics projects. McKinsey just came out with a report that only 8% of big companies…
The statistics of targeting individual voters with specific messages, as opposed to messaging that went to whole groups, began in the U.S over a decade ago with the Democrats. Political targeting is now an established business, or at least a discipline within the broader realm…
The Art of Persuasion is the title of more than one book in the self-help genre, books that have spawned blogs, podcasts, speaking gigs and more. But the science of persuasion is actually of more interest, because it produces useful rules that can be studied…
25 years ago the International Society of Quality of Life Research was founded with a mission to advance the science of quality of life and related patient-centered outcomes in health research, care and policy. While focusing on quality of life (QOL) in healthcare may seem…
Daniel Kahneman won a Nobel Prize in Economics for his work in behavioral economics, much of it with his colleague Amos Tversky, who died in 2006. Kahneman's 2011 classic, Thinking Fast and Slow, is a superbly-written non-technical summary of their fascinating research and its often…
A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale. For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly." Two key decisions the survey designer faces are
How many gradients to allow, and
Whether to include a neutral midpoint
Preparing for the Superbowl Your team is at midfield, you have the ball, it's 4th down with 2 yards to go. Should you go for it? (Apologies in advance to our many readers, especially those outside the U.S., who are not aficionados of American football,…
A digital marketer handles a variety of tasks in online marketing - managing online advertising and search engine optimization (SEO), implementing tracking systems (e.g. to identify how a person came to a retailer), web development, preparing creatives, implementing tests, and, of course, analytics. There are…
A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category. Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:
In the visualization below, which line do you think represents the UN's forecast for the number of children in the world in the year 2100? Hans Rosling, in his book Factfulness, presents this chart and notes that in a sample of Norwegian teachers, only 9%…
Can statistical and machine learning methods replace lawyers? A host of entrepreneurs think so, and do the folks who run www.artificiallawyer.com. Text mining and predictive model products are available now to predict case staffing requirements and perform automated document discovery, and natural language algorithms conduct…
Earlier, we described how Jen Golbeck (who teaches Network Analysis at Statistics.com) analyzed Facebook connections to identify fake accounts (the account holders friends all had the same number of friends, which is highly improbable statistically). Network analysis and studying connections lie at the heart of…
If you are working on New Year's Eve or New Year's Day, odds are it is from home, where you can (usually) control the temperature in the home. Which, from the standpoint of productivity, is a good thing. According to a study from Cornell, raising…
Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb. With a pretense of being an individual selling a car on his or her own, and with no fixed…
Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects. From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2. However, …
A researcher shakes a sprig from a Christmas tree, and counts the number of needles that fall. He then repeats the process for countless other sprigs. The sprigs are from a variety of species, and the goal is to determine which species do the best…
False alarms are one of the most poorly understood problems in applied statistics and biostatistics. The fundamental problem is the wide application of a statistical or diagnostic test in search of something that is relatively rare. Consider the Apple Watch's new feature that detects atrial…
QUESTION: The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent). A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud. The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud"). If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?
David Kleinbaum developed several courses for Statistics.com, including Survival Analysis, Epidemiologic Statistics, and Designing Valid Statistical Studies. David retired a little over a year ago from Emory University, where he was a popular and effective teacher with the ability to distill and explain difficult statistical…
ActivEpi Web, by David Kleinbaum, is the text used in two Statistics.com courses (Epidemiology Statistics and Designing Valid Studies), but it is really a rich multimedia web-based presentation of epidemiological statistics, serving the role of a unique textbook format for an introductory course in the…
Churn is a term used in marketing to refer to the departure, over time, of customers. Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.). A customer who leaves, for whatever reason, "churns."
This weekend (12/8/2018) marked the 253rd anniversary of the birth of Eli Whitney, inventor of the cotton gin. And 20 years ago, Google received its first big infusions of capital from, among others, Jeff Bezos, the founder of Amazon. Both Eli Whitney and the Google…
Convinced that he, like his father, would die in his 40's, Winston Churchill lived his early life in a frenetic hurry. He had participated in four wars on three continents by his mid-20's, served in multiple ministerial positions by his 30's, and published 12 books…
A classic machine learning task is to predict something's class, usually binary - pictures as dogs or cats, insurance claims as fraud or not, etc. Often the goal is not a final classification, but an estimate of the probability of belonging to a class (propensity),…
Data science is one of a host of similar terms. Artificial intelligence has been around since the 1960's and data mining for at least a couple of decades. Machine learning came out of the computer science community, and analytics, data analytics, and predictive analytics came…
The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s. For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:
Predictim is a service that scans potential babysitters' social media and other online activity and issues them a score that parents can use to select babysitters. Jeff Chester, the executive director of the Center for Digital Democracy, commented: There's a mad rush to seize the…
W. Edwards Deming's funnel problem is one of statistics' greatest hits. Deming was a noted statistician who took the statistical process control methods of Shewhart and expanded them into a holistic approach to manufacturing quality. Initially, his ideas were cooly received in the US and…
The auto industry serves as a perfect exemplar of three key eras of statistics and data science in service of industry: Total Quality Management (TQM) First in Japan, and later in the U.S., the auto industry became an enthusiastic adherent to the Total Quality Management…
Most job ads in the technical arena list communication among the sought-after skills; it consistently outranks many programming and analytical skills. Is it for real, or is it just thrown in there by the HR Department on general principle? The founder of a leading analytics…
A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you’re measuring, who you’re measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition…
Boiling oil versus egg yolks One early clinical trial was accidental. In the 16th century, a common treatment for wounded soldiers was to pour boiling oil on their wounds. In 1537, the surgeon Ambroise Pare, attending French soldiers, ran out of oil one evening. He…
An ethical algorithm... Ethics in algorithms is a popular topic now. Usually the conversation centers around the possible unintentional bias or harm that a statistical or machine learning algorithm could do when it is used to select, score, rate, or rank people. For example -…
Thirty years ago, GE became the brightest star in the firmament of statistical ideas in business when it adopted Six Sigma methods of quality improvement. Those methods had been introduced by Motorola, but Jack Welch's embrace of the same methods at GE, a diverse manufacturing…