False alarms are one of the most poorly understood problems in applied statistics and biostatistics. The fundamental problem is the wide application of a statistical or diagnostic test in search of something that is relatively rare. Consider the Apple Watch’s new feature that detects atrial fibrillation (afib).
Among people with irregular heartbeats, Apple claims a 97% success rate in identifying the condition. This sounds good, but consider all the people who do not have atrial fibrillation. In a test, 20% of the Watch’s afib alarms were not confirmed by an EKG patch. Apple claims that most of those cases did, in fact, have some heartbeat irregularity that required attention, so let’s assume a false positive rate of only 1%.
But very few Apple watch wearers, a relatively young crowd, have atrial fibrillation. The vast majority do not, and are at risk for false alarms. Specifically, less than 0.2% of the population under 55 has atrial fibrillation. Consider 1000 people. Two (0.2%) are likely to have afib, and the Apple Watch is pretty certain to catch them. Unfortunately, even using the very low false alarm rate estimate of 1%, there will be 10 false alarms – healthy people who the Watch signals as having afib. Of the 12 alarms, 10 (83.3%) were false. Put another way, if your watch tells you you are at risk for afib, the probability is 0.83 that it’s a false alarm.
The false alarm problem is thus mainly a function of
the base rate of the phenomenon of interest, and
the model’s accuracy with true negatives.
If the phenomenon is very rare, then even a very good discriminator will produce many false alarms for each true positive, since the latter are so rare and the normal cases are so plentiful. And if the model is not good at ruling out true negatives, they will be mislabeled as positives. It is noteworthy that the model’s overall accuracy, which is usually the first performance metric people look at, is not very relevant for the problem of false positives.
What are the consequences of excessive false alarms? In this case, increased anxiety, certainly. Increased costs of additional unnecessary testing. And in a few cases, if a negative case somehow survives additional tests, increased risk from more invasive tests or treatment – more people undergoing blood-thinning treatment with warfarin (a drug therapy for afib), or heart catheterization for diagnosis.
The problem also crops up in predictive models for identifying financial fraud, malicious activity in networks, employees who are likely to leave, loan defaults, and a host of similar applications. Because these events are relatively rare (out of all the cases under consideration), the predictive model typically uses a discriminator with a low bar – the probability of being a defaulter, violator, fraudster, etc. does not have to be set very high for the model to attach positive score to the person or entity. The model may still be very useful for sorting, but a naive user may overestimate the probability that the person is a fraudster, violator, etc. This can result in poor decisions and can harm individuals who are mistakenly labeled.
Medical societies, insurance companies and public health agencies have been revising guidance about routine screening exams as a result of this false positive problem, resulting in conflicting guidance from different organizations. Take mammograms, for example. The National Comprehensive Cancer Network recommends a mammogram annually starting at age 40 (the traditional guidance), while the U.S. Preventive Services Task Force (a recently appointed panel of experts reporting the the US Dept. of Health and Human Services) recommends a mammogram every two years starting at age 50.
Under-reaction to alarms is also a problem. If there are many alarms and most turn out to be false, as with the boy who cried wolf, they may be ignored. This was a major problem in the early days of using statistical and machine learning algorithms to detect malicious network activity.
The problem of false alarms produced by statistical and machine learning algorithms diminishes over time for two reasons:
As research proceeds, algorithms get better (often by tweaking, tuning and testing existing algorithms)
As data accumulates, the algorithms have better information to work with
Of course, the data must remain accessible – when my airline-affiliated credit card changed banks, the rate of false credit card fraud alarms skyrocketed. Either the new bank’s algorithms needed time to train on the data, or some historical data was unavailable to the new bank, or both.