Entity Resolution and Identifying Bad Guys - Statistics.com: Data Science, Analytics & Statistics Courses

Earlier, we described how Jen Golbeck (who teaches Network Analysis at Statistics.com) analyzed Facebook connections to identify fake accounts (the account holders friends all had the same number of friends, which is highly improbable statistically). Network analysis and studying connections lie at the heart of entity resolution.

To a sales and marketing person, entity resolution is the analytic process used to match and merge multiple customer records that refer to the same person. For example, if Joseph Smith orders something using his gmail address, and is already in the company CRM as Joe Smith with a Yahoo address, entity resolution is the process that automatically flags this likely duplication for possible merging of addresses. A credit rating agency uses entity resolution to search for individuals with profiles similar to a particular subject; these individuals might be the same person and their credit information would be relevant.

For Jeff Jonas, Founder of Senzing, entity resolution is a key part of his company’s mission to identify bad guys by connecting disparate information from various sources. Is Omar Ramsey, who is applying for a U.S. visa in Germany, the same person as Ramzi Omar, also known as Ramzi bin al-Shibh, who was denied a U.S. visa in Yemen?

Jonas notes that the scale of the data that can be brought to bear on such questions is huge, and would surprise the average citizen. As he puts it, geospatial data is superfood for analytics. For example, the average cell phone user in the US generates over 2000 geolocation data points per day, and much of this data is publicly available (marketers use it). Supposedly it is depersonalized, with names replaced by unique phone identifiers, but Jonas, points out that if you know something about a person’s habits (where they work and where they live, for example), it may be possible to identify which of the depersonalized phone records belongs to that person. That unlocks other potentially interesting information, such as where the person was last Thursday night. Or who they spend time with. Jonas’s lucid and entertaining lectures on this – for example, this one – are well worth a watch. He is a masterful story-teller, and avoids the common trap in data science talks of glossing over the foundational parts of the story in order to get to the technical stuff.

But how do you extract interesting and useful information out of the huge stock of data? The answer is – you don’t. At least not in the common sense of extracting information by asking questions and connecting dots. Rather, you let the algorithm report interesting findings in real time as they happen. Intelligence operations will pre-populate the algorithm with quasi-rules. For example, a call made by a person tagged as being of interest to a phone number known to be used by terrorists. You might also use anomaly detection methods to alert you to unusual activity of a novel nature.

Of course, given the size of today’s datasets, apparently suspicious or anomalous activity will happen all the time. So another feature of the algorithm will be a sliding threshold of connectedness, unusualness or threat that must be reached before humans are alerted. This threshold can be adjusted for an appropriate balance of false positives, false negatives, and the human hours that are available to review the alerts.

One interesting aspect of this is that the bigger the data, the faster and better the algorithm will work, providing it has been training on similar data over time. Google search is a good illustration of how big data unlocks capabilities previously unavailable. If you had searched for optimal aardvard diet when Google was just starting out, you would not have gotten promising results. Now, with two decades of data under its belt, Google not only knows where to find advice on what aardvarks eat, but it also knows that you misspelled aardvark and corrected you. Why it’s not only better but faster is a more subtle point, one that Jonas illustrates with a jigsaw puzzle. When you start out with a single piece in your hand, finding a connecting piece is an extremely time-consuming process. Once you have a bunch of connected pieces, matching up the next one is faster.

This big data efficiency of scale does not hold true for another important component of entity resolution – fixing prior mistakes. As each new piece of information arrives, it might reveal something about prior connections that revises them. For example, suppose you had a connection between Albert Gonzalez and Omar Ramsey by virtue of a shared current address, then a death certificate for Albert Gonzalex, at this address, turns up. Maybe there is no connection between Ramsey and Gonzalez, but rather Ramsey is using the address for a fake identity. Checking all prior connections for possible revision in the face of new information is computationally intensive, and the time required increases as the size of the data increases.