# Mixed Models – When to Use

Companies now have a lot of data on their customers at an individual level.  Suppose you are tasked with forecasting customer spending at a grocery chain, and you want to understand how customer attributes, local economic factors, and store issues affect customer spending. You could design your study with hierarchical and mixed linear modeling methods in mind.

These methods had their antecedents as far back as 1861, when the British Royal Astronomer George Airy gathered a set of telescopic observations on multiple nights, with multiple observations each night.  Describing the data, he noted the different variance components – within-nights and between-nights. It wasn’t until 1925, though, that R.A. Fisher presented a general method for dealing with different variance components (in his classic Statistical Methods for Research Workers).

## Clustering and Hierarchy

In your grocery project, there are also different components of variance.  Customer attributes operate at an individual customer level. This might include demographic data, prior spending and the like.  Customers are clustered at stores, so factors varying by store (e.g. store size and employee turnover, which might reflect store quality) should be modeled at the store level, by including store as an explanatory variable.  Further hierarchy is introduced by economic factors such as income levels and unemployment, which might operate at a more regional level, encompassing many stores.

Clustering is one way in which data departs from the simple model of independent observations (where one can think of observations as being picked randomly from a box.)  Other grouping occurs when you have repeated measurements (at the same time) for each subject, or when you have longitudinal data – variables recorded repeatedly over time for each subject.

## Fixed and Random Effects

You’ve probably run across the terms fixed effects and random effects.  Google these terms and you will see a lot of information and some definitions that are not consistent with each other.  Here are a few comments, paraphrased, from Linear Mixed Models by West, Welch and Galecki (Brady West and Andzrej Galecki developed our Mixed and Hierarchical Linear Models course):

Fixed factors are categorical variables, typically those being studied (e.g. gender, age group, treatment method).  Data on all categories are included, and are chosen to represent specific conditions that yield useful contrasts in a study.  In the company-wide sales study, for example, if we were specifically interested in the effect that individual stores have on turnover, all stores would need to be included in the study and this would be a fixed factor.

A random factor is a predictor (categorical or continuous) with levels that can be thought of as being randomly sampled from a population of levels being studied.  Not all possible levels of the factor are present in the data set, but the researcher intends to make inference to the population of these levels. Individual subjects might be a random factor; so would the factor “store” and “region” in a study of company-wide sales data, where only some regions and stores are modeled.  Variation in the outcome variable across different levels of the random factor is assessed as part of the model fitting.

Typically, when specifying a mixed model in software, both fixed and random effects are included as explanatory (predictor) variables, using an additional argument to specify that an effect is random, as opposed to fixed (the default).  Additional arguments are used to identify other elements of data structure, e.g. nested effects, where levels of one factor, e.g. individual stores, exist only within a level of another factor (e.g. region).