In casual statistical analysis, you sometimes hear references to outliers, along with the suggestion that they should be ignored or dropped from the analysis. Quite the contrary: often it is the outliers that convey useful information. They may represent errors in data collection, e.g. a misplaced decimal. Or they may be perfectly normal: consider these annual enrollments in a set of ten courses at Statistics.com, (8, 12, 21, 17, 6, 13, 29, 180, 11, 13). The 180 is an outlier, but it is not incorrect: it is the enrollment in our introductory statistics course, while the others represent more advanced courses. Outliers may also represent real data of great significance: The currently popular book Furious Hours describes the activities of one outlier in insurance history, Willie Maxwell, who took out dozens of life insurance policies on five relatives then collected the proceeds after their mysterious deaths.
Often it is the outliers, and not the general data, that are of interest. The airline passenger who pays for his ticket in cash. A sudden series of large invoices from a new supplier. The credit card charge from a sporting goods store in Florida at a time when the customer is at home in New York. The search is for the unusual, not the usual.
One noted example of outlier discovery came in the Social Security report of U.S. “super-earners” in 2009. A super-earner is a taxpayer who reports more than $50 million in wages. While there are thousands of people in the U.S. with incomes above $50 million, those with wages that high are relatively few: there were only 74 in 2009. The Social Security Administration reported that the average overall income (not just wages) of these select few more than quintupled in 2009, to an average of $519 million. Coming during a severe recession caused by the financial collapse of 2008, this quintupling of the uber-rich’s income became a political lightning rod.
Shortly after the report was issued, analysts found that just two individuals were responsible for this entire increase. Between them, these two taxpayers reported more than $32 billion in income on multiple W2 (tax) filings. No information was available on who the individuals were or why they reported such astronomical sums. However, the Social Security Administration did determine that the filings from the two individuals were in error and issued a revised report. The results?
- 2009 super-earner average wages actually declined 7.7 percent from 2008 instead of quintupling.
- 2009 average wages for all workers declined $598 from 2008; the original report was $384.
These two outliers had a huge and misleading impact on key government statistics. They contributed a false $214 to the average income of all wage earners, and when they were removed, the recession’s hit to wages grew by more than 50 percent. At the same time, the fuel they added to the income distribution debate was illusory.
Outlier detection is used in many applications:
- Network security (detecting anomalous data packets that might pose a threat)
- Fraud detection (uncovering fraudulent insurance claims or tax returns)
- Credit card misuse (raising alerts on potentially fraudulent transactions)
- Predicting electrical system failure (finding operating anomalies that presage a failure)
- Prescription abuse (detecting excessive or inappropriate prescriptions)
In the latter category fall two examples:
One research group (Weiss, et al) studied insurance records to identify physicians who over-prescribed opioid pain medications. Their algorithm was simple:
- Identify the top decile of providers
- Within that decile, use k-nearest-neighbors to predict a physician’s level of prescribing
- Identify those whose actual prescriptions exceeded predicted prescriptions by a certain amount
This approach had the advantage of being user-friendly to physicians in its explanatory power: by the nearest-neighbor algorithm, physicians were being compared to their peers.
Another group of researchers in Germany (Hirsch et al) used multiple techniques to identify physician practices that were prescribing expensive drugs at a high rate. The German national health insurance authorities were interested in using these results to monitor prescribing tendencies at physician practices. Unlike the opioid example, where the goal was to address the addiction crisis, in this case the motivation was to limit the cost exposure of the insuring agency.
Elder Research has done considerable work on detecting opioid over-use and over-prescription using a variety of techniques, including data preparation for manual outlier detection, use of its own fraud-detection visualization tool (RADR), predictive modeling to identify potential user addiction, and network analysis to identify fraudulent health providers. Read more here.
Outlier detection, when not a goal in itself, is often an intermediate step in the model building process:
- In one project the training data included expenditure amounts, some of which had been corrected after the data were collected. This meant that the model-fitting process was using data that would not be available at deployment (i.e. the nonzero “corrected-later” records). The analyst would not have noticed this were it not for a discrepancy between the expenditure amount and a second binary variable that reported whether there had been expenditure. The latter variable was not updated in the correction process.
- In another text mining project, the analysts thought they were working with an English-only website, only to discover as part of the topic modeling that there were other languages.
- In many commercial datasets, numeric outliers like clusters of $9999.99 represent “missing” or “unknown.”
Outlier analysis can be crucial when an erroneous value has a significant impact on the model. In one project, a data error for just one baby turned weight, counter-intuitively, into a positive correlate for risk – everyone knows that higher premie weights translate to better health. Some data detective work turned up one 800-pound baby (a clerical error) that, all by itself, turned the regression slope positive for weight versus survival.
Note that the concept of “outlier” or “anomaly” does not necessarily require that a record’s variable values lie outside the boundaries of their potential values. In one aerospace project, for example, the relevant variables for projecting trajectories included velocity, angle of attack, altitude, yaw, pitch. and roll. It is entirely possible for there to be numerous infeasible or anomalous combinations with individual values that lie within the range for each variable. For example, a trajectory could have within-range yaw (feasible when taken by itself) at one point, and within-range roll (feasible when taken by itself) at another, yet the combination of the two might be aerodynamically impossible.
While the depiction of outliers in statistical plots and distributions might tempt some to think of them as peripheral, in fact they are often the small but valuable wheat germ that must be separated from the voluminous chaff. Our course Anomaly Detection walks you through the statistical and machine learning methods that are used to identify outliers: see the course spotlight section.
Acknowledgments: Thanks to the data science practitioners at Elder Research for these examples, in particular John Elder, Wayne Folta, Carl Hoover and Tom Shafer.