Predictim is a service that scans potential babysitters’ social media and other online activity and issues them a score that parents can use to select babysitters. Jeff Chester, the executive director of the Center for Digital Democracy, commented:
There’s a mad rush to seize the power of AI to make all kinds of decisions without ensuring it’s accountable to human beings people have drunk the digital Kool-Aid and think this is an appropriate way to govern our lives. (Washington Post, 11/24/18)
Does/should AI make decisions? In transformative technologies like self-driving, the answer is unavoidably yes. If a human must remain behind the wheel to make or ratify driving decisions, the goal of self-driving technology is largely unattained. But the attention that Predictim attracted resulted last week in the loss of its automated access (scraping privileges) to the Facebook platform as a source of data.
The Triage Nurse
In many bread and butter applications of statistical and machine learning, the proper role of predictive AI is not that of the robot doctor rendering diagnoses and according treatments, but rather that of the triage nurse.
In the 1790’s a French military surgeon established a systematic categorization of military casualties termed triage (from the French trier, to separate). Those for whom immediate treatment was critical and beneficial receive priority. Those whose condition was not so urgent, and those whose condition was so grave that they were unlikely to benefit from treatment, had lower priority.
President Obama once described the unremitting intensity of Presidential decision-making thusly:
The only things that land on my desk are tough decisions. Because, if they were easy decisions, somebody down the food chain’s already made them.
This is where machine learning and AI should be leading us – not taking all our decision-making jobs away from us, or even the important ones, just the easy and routine ones.
The Ranking of Records
Just like nurses, predictive models perform triage, ranking records according to their probability of being of interest, and allowing humans to make determinations for a very limited set of records. The sorting could happen in two ways. Consider review of tax returns, where the tax authority has the capacity to audit a certain number of returns per year. A statistical or machine learning predictive algorithm sorts them according to probability of requiring an audit, and then either
Humans review all the returns that score high enough and decide whether to refer them for audit, or
The very top scoring returns are auto-referred to audit, then humans review a lower-scoring tier and decide whether to refer those to audit.
The fact that the model’s goal is ranking, rather than binary prediction, has important implications when it comes to assessing predictive models on their performance. Accuracy (the percent of records correctly classified) may not be appropriate – particularly when the percentage of records that are of interest is low. In this rare case situation, models can attain high accuracy scores simply by classifying everyone as belonging to the dominant class.
A common metric is area under the curve, or AUC. The curve in question is the Receiver Operating Characteristics (ROC) curve. The area under this curve is a measure of how well a model discriminates between two classes of records – a 1 indicates perfect discrimination, and a 0.5 indicates no better than random guessing. See the Word of the Week item on ROC curves for more detail.
The ROC is a measure of the performance of the model with the entire dataset that was modeled. Often, one is more interested in how well the model does with a smaller subset of records, specifically the top-ranked records. For example, how well did a model do with the top 10% of tax returns judged most likely to be fraudulent?
For this, modelers use the concept of lift, the cumulative or segment-wise improvement one gets from the model instead of choosing randomly in search of the records of interest. For example, a lift of 100% in the top decile means that you are twice as likely to find a record of interest in the model’s top ranked decile, compared to choosing randomly. Lift comes from the early days of predictive modeling for direct mail. Direct mailers are usually faced with low response rates and need a tool that allows them to select only the most likely responders.
Triage and the Sunny Future of AI
AI’s role in taking over the routine and repetitive information-based tasks has the potential to enrich working lives by operating via triage, rather than full auto-decision-making. Jobs will shift towards the more challenging and interesting ones, the supply of which will increase as the economy shifts in response to the unlocking of human creativity. This is the case that Jeff Bezos made in explaining why he was not worried about AI taking away jobs.
The one potential landmine in this scenario is the one planted by the natural human instinct for making money.
Ethics in Data Science
Predictim knows that it is imperfect in risk-scoring babysitters. But it also knows that parents aren’t able to weigh the nuances of statistical estimates; all they have is a single score. Predictim also knows that the mystery surrounding AI helps it sell the product, which they don’t even need to over-hype.
The ethical data scientist would cloak such a product in sufficient warnings that it would not be misused. Or perhaps not sell a product like this at all. The commercial data scientist offers up the babysitter score, cloaked in the mystique of artificial intelligence. If the consumer invests it with more meaning than it really has, well. Caveat emptor.