Recidivism, and the Failure of AUC - Statistics.com: Data Science, Analytics & Statistics Courses

On average, 40% – 50% of convicted criminals in the U.S. go on to commit another crime (“recidivate”) after they are released. For nearly 20 years, court systems have used statistical and machine learning algorithms to predict the probability of recidivism, and to guide sentencing decisions, assignment to substance abuse treatment programs, and other aspects of prisoner case management. One of the most popular systems is the COMPAS software from Equivant (formerly Northpointe), which uses 173 variables to predict whether a defendant or prisoner will commit a further crime within two years.

Racial Bias Alleged

In 2016, ProPublica published a critique of COMPAS, which could be summed up in its title:

“Machine bias: There’s software used across the country to predict future criminals. And it’s biased against blacks.”

Specifically, according to a ProPublica review of experience with COMPAS in Florida (a study of 7000 defendants, as cited in Science Advances):

Black defendants who did not recidivate were incorrectly predicted to reoffend (False Alarm) at a rate of 44.9%, whereas the same error rate for White defendants was only 23.5%.
White defendants who did recidivate were incorrectly predicted to not reoffend (False Dismissal) at a rate of 47.7%, nearly twice the rate of their Black counterparts at 28.0%.

Bottom line: COMPAS seems to be biased against Black defendants.

COMPAS Defended

Subsequently, a three-person team of researchers (Flores et al), published a rejoinder that defended the COMPAS system. Their 36-page report delved deep into different theories of test bias, but running through their analysis were two key points:

The “Area Under the Curve” (AUC) was a healthy 0.71 overall, indicating the COMPAS model has good predictive power
The AUC was about the same for White defendants and Black defendants, indicating no unfairness or bias.

What is AUC?

The curve in “Area Under the Curve” is the Receiver Operating Characteristics (ROC) curve. The steeper it is the better, so it became common to use the area under that curve as a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s. For example, defendants who re-offend (1’s) and ones who don’t (0’s). The ROC plots two quantities:

Sensitivity (also called recall in machine learning): The proportion of 1’s (re-offenders) the model correctly identifies; plotted on the y-axis
Specificity: The proportion of 0’s (non-re-offenders) the model correctly identifies (plotted on the x-axis, in reverse: 1 on the left and 0 on the right)

Specifically, the model ranks all the records by probability of being a 1, with the most probable 1’s ranked highest. To plot the curve, proceed through the ranked records and, at each record, calculate cumulative sensitivity and specificity to that point. A very well-performing model will catch lots of 1’s before it starts misidentifying 0’s as 1’s; it will climb steeply and hug the upper-left corner of the plot. Misidentifying 0’s as 1’s will shrink the curvature and bring the ROC closer to the straight diagonal line; so will misidentifying 1’s as 0’s.

Figure 1: Receiver Operating Characteristics (ROC) curve.

The closer the ROC curve lies to the upper left corner, the closer the AUC is to 1, and the greater the discriminatory power. The diagonal line represents a completely ineffective model, no better than random guessing. It has an AUC of 0.5.

AUC is perhaps the most commonly used metric of a model’s discriminatory power.

Resolving the Puzzle – All Errors are Not Equal

How could you end up with the bias uncovered by ProPublica when the model performs equally-well for both Black and White defendants, at least according to AUC? The answer is that there are two types of error: (1) predicting a defendant will re-offend when they don’t, and (2) predicting they won’t re-offend when they do. AUC treats them the same, considering them just generic “errors.”

For Black defendants, COMPAS made more of the first error and fewer of the second. For White defendants, COMPAS made more of the second error and fewer of the first. The two roughly balanced each other out in terms of total errors, resulting in AUCs that were roughly the same for White and Black defendants.

Summary

Assessing model performance with a single numerical metric, AUC, concealed the fact that the model erred in different ways for Black and White defendants to the great disadvantage of Black defendants. It could be argued that the model is so bad, that perhaps defendants, at least Black defendants, might have been better off with no model.