Statistical and Machine Learning Methods for Analyzing Clusters and Detecting Anomalies

This course will teach you how to use various cluster analysis methods to identify possible clusters in multivariate data. Methods discussed include hierarchical clustering, k-means clustering, two-step clustering, and normal mixture models for continuous variables.


Clusters are clumps of data that are internally cohesive and separated from other clusters. In marketing disciplines, cluster analysis is the basis for identifying clusters of customer records, a process call market segmentation. An anomaly is a pattern in the data that does not conform to expected normal behavior. In one sense an anomaly is the flip side of a cluster: a data point, or points that are distant from a cluster. Anomaly detection is useful in a variety of fields (surveillance for fraud, monitoring of complex industrial processes, to name two). This is a hands-on course in which you will use statistical software to apply cluster method algorithms to real data, and interpret the results. This same cluster analysis can be used to identify anomalies. The course also covers the use of supervised learning algorithms to identify anomalies.

Learning Outcomes

After taking this course, you will be able to:

  • Conduct hierarchical cluster analysis and k-means clustering to identify clusters in multivariate data
  • Use normal mixture models for clustering of continuous variables
  • Interpret/diagnose the output of different clustering procedures
  • Apply normalization of data appropriately in cluster analysis


  • Identify the assignment of cases to clusters
  • Determine how to apply a supervised learning algorithm to a classification problem for anomaly detection
  • Apply and assess a clustering algorithm for identifying anomalies in the absence of labels

Who Should Take This Course

  • Marketing analysts who need to cluster customer data as part of a market segmentation strategy;
  • Computational biologists (e.g. for taxonomy);
  • Environmental scientists (e.g. for habitat studies);
  • IT specialists (e.g. in modeling web traffic patterns);
  • Military and national security analysts (e.g. in automated analysis of intercepted communications).

Our Instructors

Course Syllabus

Week 1

Hierarchical Clustering

  • Hierarchical clustering – dendrograms
  • Divisive vs. agglomerative methods
  • Distance metrics
  • Different linkage methods
  • Single linkage as anomaly detector

Week 2

K-means Clustering

  • K-means Clustering
  • Choosing number of clusters

Week 3

Normal Mixture Model

  • Finite mixture model
  • Statistical models to identify constituent groups
  • K-means cluster as a special case

Week 4

Practical Considerations

  • Using subsets of variables
  • Different data types
  • Cluster quality and robustness

Class Dates


11/11/2022 to 12/09/2022


05/26/2023 to 06/23/2023
11/10/2023 to 12/08/2023


We assume you are versed in statistics. This course assumes knowledge of supervised learning, and some multivariate data is needed, such as that provided in the following courses.

Predictive Analytics 1 – Machine Learning Tools

This online course introduces the basic paradigm of predictive modeling: classification and prediction.
  • Skill: Intermediate
  • Credit Options: ACE, CAP, CEU

Predictive Analytics 2 – Neural Nets and Regression

As a continuation of Predictive Analytics 1, this course introduces to the basic concepts in predictive analytics to visualize and explore predictive modeling.
  • Skill: Intermediate
  • Credit Options: ACE, CAP, CEU
Additional Information


Homework in this course consists of short answer questions to test concepts and guided data analysis problems using software. In addition to assigned readings, this course also has an end of course data modeling project.

Course Text

This course will use papers that will be made available electronically, and will also refer to sections from the book Cluster Analysis, 5th Edition, by Brian S. Everitt, Dr Sabine Landau, Dr Morven Leese, Dr Daniel Stahl.


This is a hands-on course. Participants will apply cluster methods algorithms to real data, and interpret the results, so software capable of doing cluster analysis is required. The model solutions for the assignments were developed in IBM SPSS Statistics and Latent Gold. In addition, we also provide solutions using R. Other possible choices include XLStat and Analytic Solver Data Mining.

Options for Credit and Recognition

ACE CREDIT | College Credit
This course has been evaluated by the American Council on Education (ACE) and is recommended for Graduate credit, 3 semester hours in statistics. Please note that the decision to accept specific credit recommendations is up to the academic institution accepting the credit.

Supplemental Information

Literacy, Accessibility, and Dyslexia

