Density
Density is a metric that describes how well-connected a network is
We have an extensive statistical glossary and have been sending out a "word of the week" newsfeed for a number of years. Take a look at the results
Consider the multi-arm bandit problem where each arm has an unknown probability of paying either 0 or 1, and a specified payoff discount factor of x (i.e. for two successive payoffs, the second is valued at x% of the first, where x < 100%). The Gittens index is [...]
There are various ways to recommend additional products to an online purchaser, and the most effective ones rely on prior purchase or rating history -
Autoregressive refers to time series forecasting models (AR models) in which the independent variables (predictors) are prior values of the time series itself.
Hospitals are a major employer of statisticians and analytics professionals, both in support of clinical research like the retinopathy study described earlier, and to improve hospital operations (outcomes, cost management, etc.). Here are a few quick facts about the hospital industry: US hospital revenue totals…
Question: A baseball team is comparing two of its hitters, Hernandez and Dimock. Hernandez hit .250 in 2017 and .275 in 2018. Dimock did worse in both years - .245 in 2017 and .270 in 2018. Overall, though, Dimock hit better across the two years,…
Some applications of machine learning and artificial intelligence are recognizably impressive - predicting future hospital readmission of discharged patients, for example, or diagnosing retinopathy. Others - self-driving cars, for example - seem almost magical. The matching problem, though, is one where your first reaction might…
Cliff T. Ragsdale teaches several courses for the Institute in the area of operations research, based on his best selling text “Spreadsheet Modeling and Decision Analysis.” One of Cliff’s special talents is making his subject, which can be quite challenging technically, widely accessible. His courses do…
When a new technology arrives, consulting companies can quickly add staff and expertise to build institutional capacity centered around the technology in ways companies focused on delivering their own products and services cannot. Large consulting companies like Booz Allen and McKinsey, as well as smaller…
The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame). Analytics has come a long way in…
Nothing better illustrates the encroachment of data science and analytics on the older “economy of tangible things” than the business of delivering packages. The use of analytics in package delivery is not new. Companies like UPS and Fedex are longtime users of operations research methods…
In the U.S., credit scoring is dominated by three companies - Experian, TransUnion and Equifax, employing roughly 30,000 people. An important player in the scoring methodology is FICO, previously Fair Isaac Corporation, and the scores are typically called “FICO scores.” Credit scoring is the oldest…
Weeds are big business - the global herbicide market is over $35 billion annually. Weeds are also big government (think “invasive species”). California’s listing of weeds is called Encycloweedia, and the state publishes a quarterly newsletter called Noxious Times. Colorado publishes a similar periodical, Invader.…
The application of analytics to agriculture has given rise to what is called “precision agriculture,” a science that seeks to take advantage of and use detailed information that is local in time and place. Tractors and farm equipment are being equipped with sensors and software…
Many jobs are centered around risk management. If you’re looking through job postings, of course, you’ll see lots of jobs whose purpose is to make sure that nothing bad happens - the equivalent of locking the doors and closing the windows. More interesting from a…
The auto industry serves as a perfect exemplar of three key eras of statistics and data science in service of industry: Total Quality Management (TQM) First in Japan, and later in the U.S., the auto industry became an enthusiastic adherent to the Total Quality Management…
Data science is one of a host of similar terms. “Artificial intelligence” has been around since the 1960’s and “data mining” for at least a couple of decades. “Machine learning” came out of the computer science community, and “analytics,” “data analytics,” and “predictive analytics” came…
Convinced that he, like his father, would die in his 40’s, Winston Churchill lived his early life in a frenetic hurry. He had participated in four wars on three continents by his mid-20’s, served in multiple ministerial positions by his 30’s, and published 12 books…
World War II was a crucible of technological innovation, including advances in statistics. Jacob Wolfowitz, born a century ago (1920), looked at the problem of noisy radio transmissions. Coded radio transmissions were critical elements of military command and control, and they were plagued by the…
The Statistics.com courses have helped me a lot, pushing me to the limit and making me learn much more than I expected I could. The knowledge I gained I could immediately leverage in my job ... then eventually led to landing a job in my…
Certificate Student Profile of Cristobal Bazan My courses help me look at more complex problems using different approaches to show more interesting aspects of conditions, beyond just tables and charts, more than just sampling or descriptive statistics. Cristobal Bazan United Nations Agency How do you…
It is a truism of machine learning and predictive analytics that 80% of an analyst's time is consumed in cleaning and preparing the needed data. I saw an estimate by a Google engineer that 25% of the time was spent just looking for the right…
Many jobs are centered around risk management. If you're looking through job postings, of course, you'll see lots of jobs whose purpose is to make sure that nothing bad happens - the equivalent of locking the doors and closing the windows. More interesting from a…
Question: You work for an internet real-estate company, building statistical models to predict home price on the basis of square footage, number of bedrooms, number of bathrooms, property type (single family home, townhouse, multiplex), and age. Surprisingly, you find the coefficient for bedrooms is negative,…
The cost of bringing a new drug to market is over $2 billion, by some estimates. This covers the R&D, clinical trial testing and regulatory approval costs of the drug that makes it through the whole process, and also the same costs of the 9…
This week's book review is of Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are, Seth Stephens-Davidowitz's fascinating book about how social media data reveals all sorts of things about us that we barely know ourselves.…
The application of analytics to agriculture has given rise to what is called "precision agriculture", a science that seeks to take advantage of and use detailed information that is local in time and place. Tractors and farm equipment are being equipped with sensors and software…
In 1919, Ronald A. Fisher was appointed as chief statistician at the agricultural research station in Rothamsted, a post created for him. His work there resulted, in 1925, in the publication of his classic Statistical Methods for Research Workers. An important message of his book…
Prof. David Unwin has guided, developed and taught the spatial analysis curriculum at Statistics.com since 2005. David lives in central England, about an hour north of the storied Rothamsted agricultural research center. Until his retirement in 2002, he was Professor of Geography at Birkbeck College,…
A tensor is the multidimensional extension of a matrix (i.e. scalar > vector > matrix > tensor).
Question: You have a supervised learning task with 30 predictors, in which 5% of the observations are missing. The missing data are randomly distributed across variables and records. If your strategy for coping with missing data is to drop records with missing data, what proportion…
Barry Eggleston is a health research statistician who has worked on both clinical trials and observational studies, and is currently with RTI in North Carolina. In his early career, his work was solely designing and analyzing clinical trials using typical biostatistics methods ranging from t-test…
In the U.S., credit scoring is dominated by three companies - Experian, TransUnion and Equifax, employing roughly 30,000 people. An important player in the scoring methodology is FICO, previously Fair Isaac Corporation, and the scores are typically called "FICO scores." Credit scoring is the oldest…
The IRS (U.S. Internal Revenue Service) has been using computers to choose tax returns for audit since 1962. Early on, the selection was rule-based, but the IRS turned to statistical modeling in 1969, using the oldest predictive analytics model in the toolbox - discriminant analysis.…
Cathy O'Neil's Weapons of Math Destruction, when it was first published in 2016, sounded an early alarm about the big data algorithms and their potential for social evil. The cover is adorned with a robotic death's head and the subtitle reads "How Big Data Increases…
80 years ago, in 1939, Alan Turing began work on the code-breaking system that would eventually prove key in helping Britain survive the German submarine threat in the Atlantic. Last month, the Turing Award in computer science prize (sometimes referred to as the "Nobel Prize…
To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing synonyms, like these:
To a statistician, a sample is a collection of observations (cases). To a machine learner, it’s a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms like these:
To a statistician, a sample is a collection of observations (cases). To a machine learner, it's a single observation. Modern data science has its origin in several different fields, which leads to potentially confusing homonyms and synonyms, like these: Homonyms (words with multiple meanings): Bias: To…
Nothing better illustrates the encroachment of data science and analytics on the older "economy of tangible things" than the business of delivering packages. The use of analytics in package delivery is not new. Companies like UPS and Fedex are longtime users of operations research methods…
Prior to the advent of internet-connected devices, the largest source of big data was public interaction on the internet. Social media users, as well as shoppers and searchers on the internet, make an implicit deal with the big companies that provide these services: users can…
The field of sports statistician is not exactly new; the American Statistical Association's section on Sports Statistics was formed in 1992. Three of Statistics.com's instructors have professional experience in sports statistics - Ben Baumer (SQL) served as statistician for the NY Mets, Stephanie Kovalchik (Meta…
The U.S. baseball season opens Thursday, March 28, and celebrates the 48th season of analytics in baseball, beginning with the founding of the Sabermetric Society in 1971 (the same year that Satchel Paige entered the Hall of Fame). Analytics has come a long way in…
When variables have binary (yes/no) values, a couple of issues come up when measuring distance or similarity between records. One of them is the "yacht owner" problem.
Charles Darwin, the most famous grandson of the Enlightenment thinker Erasmus Darwin, published his ground-breaking theory of evolution, “The Origin of Species,”160 years ago. Another grandson of Erasmus, Francis Galton, became one of the founding fathers of statistics (correlation, the “wisdom of the crowd,” regression…
Are you "young and rustic?" Or perhaps a "toolbelt traditionalist?" These are nicknames given to customer segments identified by market research firm Claritas, with its statistical clustering tool. Long before the advent of individualized product recommendations, business sought to segment customers into distinct groups on…
CRO's, or contract research organizations, are a $40 billion industry, growing at close to 12% per year. They provide contract services to the pharmaceutical industry, including statistical design and analysis, laboratory services, administration of clinical trials, and monitoring of drugs once they are on the…
In most statistical modeling or machine learning prediction tasks, there will be cases that can be easily predicted based on their predictor values (signal), as well as cases where predictions are unclear (noise). Two statistical learning methods, boosting and ProfWeight, use those difficult cases in…
Your country is at war, and an enemy plane has crashed on your territory. It bears the number 60, and a spy has told you that the aircraft are numbered serially. Can you make a guess about the total number of aircraft the enemy has…
Rectangular data are the staple of statistical and machine learning models. Rectangular data are multivariate cross-sectional data (i.e. not time-series or repeated measure) in which each column is a variable (feature), and each row is a case or record.
When a new technology arrives, consulting companies can quickly add staff and expertise to build institutional capacity centered around the technology in ways companies focused on delivering their own products and services cannot. Large consulting companies like Booz Allen and McKinsey, as well as smaller…
In 1994, Jim Collins and Jerry Porras, former and current Stanford professors, published the best-seller Built to Last that described how "long-term sustained performance can be engineered into the DNA of an enterprise." It sold over a million copies. Buoyed by that success, Collins and a…
Selection bias is a sampling or data collection process that yields a biased, or unrepresentative, sample. It can occur in numerous situations, here are just a few:
In 1986, the U.S. space shuttle Challenger exploded several minutes after launch. A later investigation found that the cause of the disaster was O-ring failure, due to cold temperatures. The temperature at launch was 39 degrees, colder than any prior launch. The cold caused the…
The statistics of targeting individual voters with specific messages, as opposed to messaging that went to whole groups, began in the U.S over a decade ago with the Democrats. Political targeting is now an established business, or at least a discipline within the broader realm…
25 years ago the International Society of Quality of Life Research was founded with a mission to advance the science of quality of life and related patient-centered outcomes in health research, care and policy. While focusing on quality of life (QOL) in healthcare may seem…
A "likert scale" is used in self-report rating surveys to allow users to express an opinion or assessment of something on a gradient scale. For example, a response could range from "agree strongly" through "agree somewhat" and "disagree somewhat" on to "disagree strongly." Two key decisions the survey designer faces are
How many gradients to allow, and
Whether to include a neutral midpoint
Preparing for the Superbowl Your team is at midfield, you have the ball, it's 4th down with 2 yards to go. Should you go for it? (Apologies in advance to our many readers, especially those outside the U.S., who are not aficionados of American football,…
A digital marketer handles a variety of tasks in online marketing - managing online advertising and search engine optimization (SEO), implementing tracking systems (e.g. to identify how a person came to a retailer), web development, preparing creatives, implementing tests, and, of course, analytics. There are…
A dummy variable is a binary (0/1) variable created to indicate whether a case belongs to a particular category. Typically a dummy variable will be derived from a multi-category variable. For example, an insurance policy might be residential, commercial or automotive, and there would be three dummy variables created:
In the visualization below, which line do you think represents the UN's forecast for the number of children in the world in the year 2100? Hans Rosling, in his book Factfulness, presents this chart and notes that in a sample of Norwegian teachers, only 9%…
Can statistical and machine learning methods replace lawyers? A host of entrepreneurs think so, and do the folks who run www.artificiallawyer.com. Text mining and predictive model products are available now to predict case staffing requirements and perform automated document discovery, and natural language algorithms conduct…
If you are working on New Year's Eve or New Year's Day, odds are it is from home, where you can (usually) control the temperature in the home. Which, from the standpoint of productivity, is a good thing. According to a study from Cornell, raising…
Curbstoning, to an established auto dealer, is the practice of unlicensed car dealers selling cars from streetside, where the cars may be parked along the curb. With a pretense of being an individual selling a car on his or her own, and with no fixed…
Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects. From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2. However, …
A researcher shakes a sprig from a Christmas tree, and counts the number of needles that fall. He then repeats the process for countless other sprigs. The sprigs are from a variety of species, and the goal is to determine which species do the best…
False alarms are one of the most poorly understood problems in applied statistics and biostatistics. The fundamental problem is the wide application of a statistical or diagnostic test in search of something that is relatively rare. Consider the Apple Watch's new feature that detects atrial…
QUESTION: The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent). A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud. The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as "fraud"). If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?
David Kleinbaum developed several courses for Statistics.com, including Survival Analysis, Epidemiologic Statistics, and Designing Valid Statistical Studies. David retired a little over a year ago from Emory University, where he was a popular and effective teacher with the ability to distill and explain difficult statistical…
ActivEpi Web, by David Kleinbaum, is the text used in two Statistics.com courses (Epidemiology Statistics and Designing Valid Studies), but it is really a rich multimedia web-based presentation of epidemiological statistics, serving the role of a unique textbook format for an introductory course in the…
Churn is a term used in marketing to refer to the departure, over time, of customers. Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.). A customer who leaves, for whatever reason, "churns."
This weekend (12/8/2018) marked the 253rd anniversary of the birth of Eli Whitney, inventor of the cotton gin. And 20 years ago, Google received its first big infusions of capital from, among others, Jeff Bezos, the founder of Amazon. Both Eli Whitney and the Google…
A classic machine learning task is to predict something's class, usually binary - pictures as dogs or cats, insurance claims as fraud or not, etc. Often the goal is not a final classification, but an estimate of the probability of belonging to a class (propensity),…
Data science is one of a host of similar terms. Artificial intelligence has been around since the 1960's and data mining for at least a couple of decades. Machine learning came out of the computer science community, and analytics, data analytics, and predictive analytics came…
The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s. For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:
W. Edwards Deming's funnel problem is one of statistics' greatest hits. Deming was a noted statistician who took the statistical process control methods of Shewhart and expanded them into a holistic approach to manufacturing quality. Initially, his ideas were cooly received in the US and…
The auto industry serves as a perfect exemplar of three key eras of statistics and data science in service of industry: Total Quality Management (TQM) First in Japan, and later in the U.S., the auto industry became an enthusiastic adherent to the Total Quality Management…
Most job ads in the technical arena list communication among the sought-after skills; it consistently outranks many programming and analytical skills. Is it for real, or is it just thrown in there by the HR Department on general principle? The founder of a leading analytics…
A prospective study is one that identifies a scientific (usually medical) problem to be studied, specifies a study design protocol (e.g. what you’re measuring, who you’re measuring, how many subjects, etc.), and then gathers data in the future in accordance with the design. The definition…
Boiling oil versus egg yolks One early clinical trial was accidental. In the 16th century, a common treatment for wounded soldiers was to pour boiling oil on their wounds. In 1537, the surgeon Ambroise Pare, attending French soldiers, ran out of oil one evening. He…
An ethical algorithm... Ethics in algorithms is a popular topic now. Usually the conversation centers around the possible unintentional bias or harm that a statistical or machine learning algorithm could do when it is used to select, score, rate, or rank people. For example -…
Thirty years ago, GE became the brightest star in the firmament of statistical ideas in business when it adopted Six Sigma methods of quality improvement. Those methods had been introduced by Motorola, but Jack Welch's embrace of the same methods at GE, a diverse manufacturing…
In a couple of days, theWall Street Journalwill come out with its November survey of economists' forecasts. It's a particularly sensitive time, with elections in a few days and President Trump attacking the Federal Reserve for for raising interest rates. It's a good time to…
Simulation - a Venerable History One of the most consequential and valuable analytical tools in business is simulation, which helps us make decisions in the face of uncertainty, such as these: An airline knows on average, what proportion of ticketed passengers show up for a…
It is 100 years since R A Fischer introduced the concept of "variance" (in his 1918 paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance).
"Bag" refers to "bootstrap aggregating," repeatedly drawing of bootstrap samples from a dataset and aggregating the results of statistical models applied to the bootstrap samples. (A bootstrap sample is a resample drawn with replacement.)
I used the term in my message about bagging and several people asked for a review of the bootstrap. Put simply, to bootstrap a dataset is to draw a resample from the data, randomly and with replacement.
It is 100 years since R A Fischer introduced the concept of "variance"(in his 1918 paper "The Correlation Between Relatives on the Supposition of Mendelian Inheritance"). There is much that statistics has given us in the century that followed. Randomized clinical trials, and the means to…
Casting back long before the advent of Deep Learning for the "founding fathers" of data science, at first glance you would rule out antecedents who long predate the computer and data revolutions of the last quarter century. But some consider John Tukey (below), the Princeton statistician…
Python started out as a general purpose language when it was created in 1991 by Guido van Rossum. It was embraced early on by Google founders Sergei Brin and Larry Page ("Python where we can, C++ where we must" was reputedly their mantra). In 2006,…
Deep learning is essentially "neural networks on steroids" and it lies at the core of the most intriguing and powerful applications of artificial intelligence. Facial recognition (which you encounter daily in Facebook and other social media) harnesses many levels of data science tools, including algorithms…
SEM stands for "structural equation modeling," and we are fortunate to have Prof. Randall Schumacker teaching this subject at Statistics.com. Randy created the Structural Equation Modeling (SEM) journal in 1994 and the Structural Equation Modeling Special Interest Group (SIG) at the American Educational Research Association…
Fake social media accounts and Russian meddling in US elections have been in the news lately, with Mark Zuckerberg (Facebook founder) testifying this week before the US Congress. Dr. Jen Golbeck, who teaches Network Analysis at Statistics.com, published an ingenious way to determine whether a…
Cambridge Analytica's wholesale scraping of Facebook user data is big news now, and people are shocked that personal data is being shared and traded on a massive scale on the internet. But the real issue with social media is not harming to individual users whose…
Two important statistical modeling courses are coming up in May. May 18 - Jun 15: Principal Components and Factor Analysis May 18 - Jun 15: Modeling Count Data Factor analysis is used frequently in social science research where you want to examine that which…
We just attended the analytics conference of INFORMS' (The Institute for Operations Research and the Management Sciences) this week in Baltimore, and they held a special meeting for directors of academic analytics programs to better align what universities are producing with what industry is seeking.…
Do you work with multiple choice tests, or Likert scale assessment surveys? Rasch methods help you construct linear measures from these forms of scored observations and analyze the results from such surveys and tests. "Practical Rasch Measurement - Core Topics" In this course, you will…