In this feature, we sometimes highlight terms that can have different meanings to different parts of the data science community, or in different contexts. Today’s term is “bias.”
- To the lay person, and to those worried about the ethical problems sometimes posed by the deployment of AI models, bias means the differential and unfair application of rules or decisions to different groups of people.
- To the pure statistician, bias refers to the tendency of an estimator (a metric applied to a sample) to consistently underestimate or overestimate a quantity of interest. For example, when applied to a sample from a population, the unadjusted variance is biased downwards; to remove the bias in the sample statistic, the denominator must be set to n-1 instead of n.
- In statistics and data science, biased data are data that are unrepresentative of the population they are supposed to represent.
- In some machine learning contexts, I have heard a definition of biased data that conflicts with #3. There, biased data are data that are not balanced (i.e. equally-sized) with respect to the groups being represented.
- In neural networks and some other models, the constant is often called the bias term (a usage borrowed from electronics).
Definitions 3 and 4 are especially confusing because data that are unbiased with respect to #3 will probably be biased with respect to #4. In surveys, and in training machine learning models, oversampling of smaller groups is often used to achieve greater statistical power (in surveys) and better predictions (in machine learning). Such oversampling will yield biased data according to definition 3, but, if carried to the point of equally-sized groups, unbiased data according to definition 4.