Anomaly Detection via Conversation: "How was your vacation?"

A friendly query about your holiday might be a question you get from a roaming agent in the check-in area at the Tel Aviv airport. Israel, considered to have the most effective airport security in the world, does not rely solely on routine mechanical screening of passengers and baggage by low-paid workers. It also uses focused professional engagement and conversation with passengers by agents who are not desk-bound. Some interview passengers who approach the check-in counter, asking how your trip was, where you are from, who is traveling with you, all with a demeanor intended to engage the traveler – perhaps that of a guest you’ve just met at a party, perhaps more challenging. Others roam the terminal, looking for people whose behavior seems out of place.

This is a mostly “unsupervised anomaly detection” process with the human brain at the center. Tone of voice, degree of eye contact, and content of the conversation are all features. So are things like religion, national origin and other factors that might not be permissible in the U. S. Judgements about who might pose a potential threat involve elements of intuition, and the rationale might not be easily articulated. Could this process be translated to a machine learning process?

Early stages of the process, based on some recognizable and easily quantifiable data (country of origin, date of passport issuance, family group or not, prior travel, originating point of travel), probably could be readily translated into a first stage review. A complex deep learning algorithm that integrates image and voice recognition could take it further. To go beyond that requires translating gut feeling and instinct into machine learning. More on this below; first let’s review how statistics and machine learning contribute to anomaly detection.

Anomaly (outlier) detection methods date back over a century and a quarter; F.Y. Edgeworth’s 1887 article “On Discordant Observations” (in Philosophy Magazine) is a common early citation. Some other examples of its use:

Detecting cyber attacks, network hacks or intrusions
Fraud detection (credit card, insurance, health care billing)
Surveillance
Monitoring of sensor data

Implicit in the concept of anomaly detection is that the thing we want to detect is rare. If it is common, then, although it might be bad, it is not an anomaly (outlier) and this makes some statistical methods inapplicable. There are several approaches to anomaly detection:

Supervised learning. If we have confirmed cases of the anomaly we are looking for, we can use supervised learning methods. Data in which we have both anomalies and normal cases constitute the training data for a classification model using any of a number of methods: logistic regression, decision trees, nearest-neighbors, naive Bayes, etc. Some examples of this scenario:

- Confirmed fraudulent credit card transactions, coupled with a much larger set of normal transactions
- MRI images of tumors that were subsequently determined to be malignant, coupled with images of tumors that were subsequently determined to be benign
- Insurance claims found to be fraudulent, coupled with claims that are legitimate

More often, we may not have many, or any, confirmed cases of the thing we are looking for. A new insurance fraud may present a different profile from the prior known cases. A new organization will lack the history needed for the training data: when American Airlines switched its credit cards to a new bank (probably for better terms), the jilted bank did not provide the full purchase history to the new bank, and customers had to endure fraud alerts at a much greater rate, until the new bank accumulated more data. In this case, semi-supervised or unsupervised methods must be used until sufficient experience with true frauds is accumulated.

Semi-supervised learning. Even if we lack confirmed cases of the anomaly we are looking for, we may have a body of confirmed non-anomalous cases, and can ask whether unknown cases look like the non-anomalous ones. This might be true where the effort required to confirm anomaly/non-anomaly is considerable, and an organization may have many unknown cases.

For example, a new insurance company, or an insurance company launching a new product, or a bank purchasing a portfolio of loans can accumulate, through investigation, knowledge of “good” cases. Due to the low fraud rate, and the greater investigational effort required, however, information of “bad” cases lags that of “good” cases. In such cases, a predictive model, as in #1, can be trained on data in which the labels are “good” and “unknown.” The predicted probability of being a “good” can be used as a score to determine which cases are referred for deeper investigation (lowest probabilities get investigated first).

As time goes by and the organization (unfortunately) gains some knowledge of “bad” cases, modeling accuracy will be improved as labeling shifts from good/unknown to good/bad. The semi-supervised approach can transition to a full supervised approach.

Unsupervised learning. In other situations, we lack confirmed knowledge of good/bad almost completely. Airline passengers that constitute a known terrorist threat are almost non-existent. Or we may have known labeled “bads” but also be aware that there are unknown bads very different from the labeled ones. For example, network intrusion detection systems can avail themselves of information about known threats and use very specific attributes of traffic behavior to identify them and alert operators. But what about unknown or emerging threats?

To be able to catch threats that we don’t already know about, a predictive model trained on existing threats may be counter-productive – missing the new threats that don’t resemble the old ones, and lending a false sense of security. A second model to catch the new threats cannot rely on learning from cases labeled as threats or cases labeled as non-threats since those labels are unavailable for new unknown threats. The model must rely on identifying cases that are different from the majority of cases.

The most popular unsupervised learning methods rely on identifying records that are distant from other records, i.e. nearest neighbor algorithms or clustering methods, both of which have at their heart some calculation of nearness between records. There are a number of ways to measure distance between records for both continuous data and categorical data. For continuous data, Euclidean distance is popular, as is a correlation measure. For categorical data, metrics that reflect the degree to which category values match are used. For mixed data, Gower’s similarity measure treats each variable separately with an appropriate metric (for continuous data or categorical data), then averages them. Having calculated inter-record distances, anomaly detection might filter for records that are far from all other records, for records that are far from the nearest cluster, for tiny clusters that are distinct from other clusters, among other criteria.

Interestingly, anomaly detection bears some resemblance to the use of control charts in statistical process control (SPC)- a method that was introduced 95 years ago! A control chart tracks a metric from some process (originally, a manufacturing process) and triggers an alert when that metric is too far from the mean, “too far” being a statistically-determined (yet still arbitrary) limit to keep workers from tinkering unnecessarily with a variable, yet stable process. This example, from Thomas P. Ryan’s Statistical Methods for Quality Improvement (Wiley, 1989, p. 160), illustrates these upper and lower control limits (UCL, LCL):

Back to the problem of replacing the final stage of the human interviews, and the question of whether this can be reduced to machine learning. This is the beyond-routine-chat stage where skilled interrogation techniques come into play, and the passengers eye movements and body language come into play. At what point does the agent decide the passenger is worthy of a much more exhaustive investigation? The agent may not be able to articulate the reasons in such as way that the relevant factors can be quantified. But deep learning, you might retort, can uncover and use relevant features even if we don’t explicitly see them. Perhaps deep learning could replicate even a thought process that we can’t explain.

We are very far from seeing a machine learning takeover of airport security. Even if an algorithm could capture our unconscious thoughts, a traveler’s interaction with and reaction to a “robot” interrogation would be so different from their behavior during a personal one as to render the latter ineffective as training for the former. Even a personal interview that is overtly rules-based may yield different behavior than a free-flowing conversation.

Actually, the inability of machine learning to go this “last mile” is a good thing. It is what helps preserve the benign view of machine learning and AI as our servant and not our master. Relegating machine learning to the routine, repetitive and boring tasks is to be cheered.

Israeli airport security https://www.huffpost.com/entry/what-israeli-airport-secu_b_4978149

https://www.intechopen.com/online-first/anomaly-based-intrusion-detection-system