Are you “young and rustic?” Or perhaps a “toolbelt traditionalist?” These are nicknames given to customer segments identified by market research firm Claritas, with its statistical clustering tool. Long before the advent of individualized product recommendations, business sought to segment customers into distinct groups on the basis of purchase behavior, demographic variables, and geography, to better target both products and sales efforts. Often companies will attach nicknames like those above to the segments that result from the statistical process, as a mnemonic device. Cluster analysis is the main statistical contribution to this effort – two algorithms are particularly popular:
Both rely on metrics of distance – distance between records, and distance between clusters. Typically, the variables are all standardized so that scale does not affect the result (you would not want the result to change if you switched from using meters to kilometers, for example). With one popular metric, Euclidean distance, you take the square root of the sum of squared the differences between records A and B (or cluster centers A and B).
The k-means algorithm randomly divides the cases into k-clusters, then iteratively assigns each case to the nearest of the k-clusters, either by moving it or leaving it where it is. At each step, cluster centers are recomputed before the next case is assigned. The process stops when the reassignment starts increasing cluster dispersion.
Hierarchical clustering starts with each case being its own (singleton) cluster. The two closest cases are then joined together into a cluster, then the next two closest, the next two, etc., until there is just one cluster left at the end, consisting of all the records. The result is a plot called a “dendrogram,” where each position on the x-axis is an individual case, and the y-axis represents distance. Here is a dendrogram of electric utilities (the variables used to cluster the data are sales and fuel cost).
Madison and Northern are quite close together and form their own small cluster, as do New England and United, at 1.4 units of distance. Moving up to 3.6 units of distance, following the horizontal line across we see there are 6 clusters at that distance. Moving to fewer clusters would require that we allow greater inter-case distance.
Hierarchical clusters have the advantage of allowing you to visualize the cluster structure and identify what might be a “natural” number of clusters. However, it does not scale well – the visualization would not be useful with thousands of cases on the x-axis.
K-means does scale, not being dependent on visualization, and is also computationally faster. However, it does require that you specify the number of clusters, k.
Another popular segmentation method that is simple from a statistical standpoint, but is powerful from a business standpoint, is called RFM for “recency, frequency and monetary” analysis. The theory behind RFM is that customers who have purchased recently and frequently, and who spend more money, constitute a definable and valuable segment that merits close attention and cultivation, regardless of how similar they are to one another. Likewise, customers who score low on all these attributes are less critical, and may not merit more than a periodic message to keep in touch. RFM was developed decades ago for the direct mail industry, but these three variables remain of key importance with broader clustering models that include more detailed purchase information, as well as demographic and geographic data.