Skip to content

Ethical Practice in Data Mining

Prior to the advent of internet-connected devices, the largest source of big data was public interaction on the internet. Social media users, as well as shoppers and searchers on the internet, make an implicit deal with the big companies that provide these services: users can take advantage of powerful search, shopping and social interaction tools for free, and, in return, the companies get access to user data.

More and more news stories appear concerning illegal, fraudulent, unsavory or just controversial uses of data science. Many people are unaware just how much detailed data on their personal lives is collected, shared, and sold. A writer for The Guardian newspaper downloaded his own personal Facebook data and it came to over 600 megabytes – a vivid and detailed portrait of his life. Any one of dozens of apps on your smartphone will collect a flow of your location information and keep tabs on you.

Financial and medical data have long been covered by both industry standards and government regulation to protect privacy and security. While academic research using human subjects’ data in most developed countries has been strictly regulated, the collection, storage, and use of personal data in industry has historically faced much less regulatory scrutiny. But concern has escalated beyond the basic issues of protecting against theft and unauthorized disclosure of personal information to the data mining and analytics methods that underlie the harvest and use of such data. In credit scoring, for example, industry regulatory requirements require that any deployed credit scoring models be highly repeatable, transparent, and auditable. In 2018, the European Union came out, for the first time, with an EU-wide regulation called the General Data Protection Regulation (GDPR). The GDPR limits and restricts the use and storage of personal data by companies and organizations operating in the EU and abroad, insofar as these organizations “monitor the behavior” of or “offer goods or services” to EU-residing data subjects. The GDPR thereby has the potential to affect any organization processing the personal data of EU-based data subjects, regardless of where the processing occurs. “Processing” includes any operation on the data, including pre-processing and the use of data mining algorithms.

The story of Cambridge University Professor Alexander Kogan is a cautionary tale. Kogan helped develop a Facebook app, “This is Your Digital Life,” which collected the answers to quiz questions, as well as user data from Facebook. The purported purpose was a research project on online personality. Although fewer than 270,000 people downloaded the app, it was able to access (via friend connections) data on 87 million users. This feature was of great value to the political consulting company Cambridge Analytica, which used this app’s data in its political targeting efforts, and ended up doing work for the Trump 2016 campaign, as well as the 2016 Brexit campaign in the UK.

Great controversy ensued: Facebook’s CEO, Mark Zuckerberg was called to account before the U.S. Congress, and Cambridge Analytica was eventually forced into bankruptcy. In an interview with Lesley Stahl on the 60 Minutes television program, Kogan contended that he was an innocent researcher who had fallen into deep water, ethically:

“You know, I was kinda acting, honestly, quite naively. I thought we were doing everything okay”¦. I ran this lab that studied happiness and kindness.”

How did analytics play a role? Facebook’s sophisticated algorithms on network relationships allowed the Kogan app to reach beyond its user base and get data from millions of users who had never used the app.

In the legal sphere, organizations must be aware of government, contractual, and industry “best-practices” regulations and obligations. But the analytics professional and data scientist must think beyond the codified rules, and think how the predictive technologies they are working on might ultimately be used. Here is just a sampling of some recent concerns:

  1. Big Brother Watching: Law enforcement and state security analysts can use personal demographic and location information for legitimate purposes, e.g. collecting information on associates of a known terrorist. But what is considered “legitimate” may be very different in Germany, the U.S., China, and North Korea. At the time of this writing, an active issue in the U.S. and Europe was fear that the Chinese company Huawei’s control of new 5G network standards would compromise Western security. Predictive modeling and network analysis are widely used in surveillance, whether for good or ill. At the same time, companies are also using similar privacy-invading technologies for commercial gains. Google was recently fined 50 million Euro by French regulators for “[depriving] the users of essential guarantees regarding processing operations that can reveal important parts of their private life” (Washington Post Jan 21, 2019)
  2. Automated Weapon Targeting and Use: Armed drones play an increasing role in war, and image recognition is used to facilitate navigation of flight paths, as well as help to identify and suggest targets. Should this capability also be allowed to “pull the trigger,” say, if a target appears to be a match to an intended target or class of targets? The leading actors in automated targeting may say now that they draw a red line at turning decision-making over to the machine, but the technology has a dynamic of its own. Adversaries and other actors may not draw such a line. Moreover, in an arms race situation, one or more party is likely to conclude that others will not draw such a line, and seek to proactively develop this capability. Automatic weapon targeting is an extreme case of automated decision-making made possible by data mining analytics.
  3. Bias and Discrimination: Automated decision making based on predictive models is now becoming pervasive in many aspects of our daily lives, from credit card transaction alerts and airport security detentions, through college admissions, resume screening for employment, to predictive policing and recidivism scores used by judges. Cathy O’Neil describes multiple cases where such automated decision making becomes “weapons of math destruction”, defined by opacity, scale, and damage (O’Neil, 2016). Predictive models in general facilitate these automated decisions; deep learning style neural networks are the standard tool for image and voice recognition. When algorithms are trained on data that contain human bias, the algorithms learn the bias, thereby perpetuating, expanding and magnifying it. The ProPublica group has been publishing articles about various such discrimination cases. Governments have already started trying to address this issue: New York City passed the U.S.’s first algorithmic accountability law in 2017, creating a task force to monitor the fairness and validity of algorithms used by municipal agencies. In February 2019, legislators in Washington State held a hearing on an algorithmic accountability bill that would establish guidelines for procurement and use of automated decision systems in government “in order to protect consumers, improve transparency, and create more market predictability.” As for company use of automated decision making, in the EU, the GDPR allows EU subjects to opt-out of automated decision making.
  4. Internet conspiracy theories, rumors and “lynch” mobs: Before the internet, the growth and dissemination of false rumors and fake conspiracy theories was limited naturally: “broadcast” media had gatekeepers, and one-to-one communications, though unrestricted, were too slow to support much more than local gossip. Social media not only removed the gatekeepers and expanded the reach of an individual, but also provided interfaces and algorithms that add fuel to the fire. The messaging application WhatsApp, via its message forwarding facility, was instrumental in sparking instant lynch mobs in India in 2018, as false rumors about child abductors spread rapidly. Fake Facebook and Twitter accounts, most notably under Russian control, helped create and spread divisive and destabilizing messaging in Western democracies with a goal of affecting election outcomes. A key element in fostering the rapid viral spread of false and damaging messaging is the recommendation algorithms used in social media – these are trained to show you “news” that you are most likely to be interested in. And what interests people, and makes them click “like” and forward, is the car wreck, not the free flow of traffic.
  5. Psychological manipulation and distraction: A notable aspect of the Cambridge Analytica controversy was the company’s claim that it could use “psychographic profiling” based on Facebook data to manipulate behavior. There is some evidence that such psychological manipulation is possible: Matz et al. (in their 2017 PNAS paper “Psychological targeting as an effective approach to digital mass persuasion”) found “In three field experiments that reached over 3.5 million individuals with psychologically tailored advertising, we find that matching the content of persuasive appeals to individuals’ psychological characteristics significantly altered their behavior as measured by clicks and purchases.” But prior psychological profiles are not needed to manipulate behavior. Facebook advisers embedded with the U.S. presidential campaign of Donald Trump guided an extremely complex and continuous system of microtargeted experimentation to determine what display and engagement variables were most effective with specific voters. Predictive algorithms, and the statistical principles of experimentation, are central to these automated algorithmic efforts to affect individual behavior. At a more basic level, there is concern that these continuous engagement algorithms can contribute to “digital addiction” and a reduction in ability to concentrate. Paul Lewis, writing in the Guardian, says algorithms like these contribute to a state of “”˜continuous partial attention’, severely limiting people’s ability to focus.” Some of the most powerful critiques have come from those with inside knowledge. An early Facebook investor, Roger McNamee, writes in his book “Zucked,” that Facebook is “terrible for America.” Chamath Palihapatiya, formerly Facebook’s VP for user growth, now says Facebook is “destroying how society works.”
  6. Data brokers and unintended uses of personal data: Reacting to a growing U.S. crisis of opioid addiction, some actuarial firms and data brokers are providing doctors with “opioid addiction risk scores” for their patients, to guide them in offering the patients appropriate medications at appropriate doses and durations, with appropriate instructions. It sounds constructive, but what happens when the data also gets to insurers (which it will), employers, and government agencies? For example, could an individual be denied a security clearance due to a high score?Individuals themselves contribute to the supply of health data when they use health apps that track personal behavior and medical information. Ability to sell such data is often essential for the app developer – a review of apps that track menstrual cycles “found that most rely on the production and analysis of data for financial sustainability.”

Where does this leave the analytics professional and data scientist? Data scientists and analytics professionals must consider not just the powerful benefits that their methods can yield in a specific context, but also the harm they can do in other contexts. A good question to ask is “how might an individual, or a country, with malicious designs make use of this technology?” It is not sufficient just to observe the rules that are currently in place; data professionals must also think and judge for themselves.