Introduction to Interpretable Clustering

Author:Murphy | View: 21290 | Time: 2025-03-23 11:56:49

Clustering is a popular unsupervised learning task that groups similar data points. Despite this being a common machine learning task, most clustering algorithms do not explain the characteristics of each cluster or why a point identifies with a cluster, requiring users to do extensive cluster profiling. This time-consuming process becomes incredibly difficult as the datasets at hand grow larger and larger. This is ironic since one of the main uses for clustering is to discover trends and patterns in the data given.

With these considerations in mind, wouldn't it be nice to have an approach that not only clustered the data but also provided innate profiles of each cluster? Well, that's where the field of interpretable clustering comes into play. These approaches construct a model that maps points to clusters, and users are ideally able to analyze this model to figure out the qualities in each cluster. In this article, I am to discuss why this field is important as well as cover some of the main tracks of interpretable clustering.

If you're interested in interpretable machine learning and other aspects of ethical AI, consider checking out some of my other articles and following me!

Interpretable and Ethical AI

Motivation

Before we talk about interpretable clustering, let's go through a motivating example:

You are a data scientist working for a large clothing retailer. Currently, the company is looking to develop a few new brands to stock in stores but needs to know what direction each brand needs to take in terms of style and price points. As a result, the clothing design team has asked you to analyze their database of customer information to generate customer profiles they can use to drive the fashion design process. This database contains a large amount of information that characterizes each previous customer such as age, gender, race, income, n_items_purchased, avg_money_spent_per_purchase,...,etc.

You, like many other data scientists, immediately think to group these customers and decide to develop a clustering model such as K-Means or Hierarchical Clustering. You get clustering assignments for each customer in the database, but is this what the clothing design team wanted? No! They want clear customer profiles they can use to make decisions. As a result, you need to assign profiles for each cluster. There are a few ways people traditionally do this:

Centroids/Statistical Summaries

One common way people are taught to profile clusters (especially ones generated by K-Means) is to look at the centroids of each cluster. However, this comes with a few problems.

First of all, centroids do not always make sense, especially when it comes to categorical variables. If I have a one-hot encoded variable for race, the values these variables take for the centroid may look something like asian=0.43, african_american=0.55,white=0.54. What would this mean? We can't create a good profile for the cluster based on this.

Additionally, centroids remove a lot of information from the cluster. If the average value of age in a cluster is 45, we might say that this cluster represents middle-aged customers…. or it could be bi-modal where there are a lot of young and old customers, but not many middle-aged individuals. We can't accurately profile a cluster by simply looking at the value of the feature in the centroid.

As a result, we would often need more complex statistical measures to profile the cluster such as the median and variance of the features in the cluster. However, this can dramatically increase the number of features that one would need to examine, making it difficult

Furthermore, utilizing the centroid to explain a cluster makes it difficult to identify which features are important for a given cluster. This becomes especially problematic when you have an extremely high-dimensional dataset as you face the curse of dimensionality. As a result, accurately creating nice profiles like the above examples is almost impossible.

In summary: Centroids and Statistical Summaries of clusters are not an ideal way to profile clusters.

Fitting a Second Model

One way to create a mapping for a given clustering is to train an interpretable classifier on the labels generated by a clustering. The idea behind this is that while the algorithm used to cluster our data (such as K-Means) may not be interpretable, the classification model could be. One common approach for this is to fit a decision tree such as CART on the results of the clustering:

Interpretable Clustering

The tree is then used to determine cluster assignments rather than the original clustering algorithm. Since the decision tree is interpretable, you can examine it to gain insights into clusters. For example, one node might split on age while another splits on gender. While this is better than interpreting centroids, there are two main problems:

Faithfulness: Since the classifier is not the algorithm used to cluster the data, the explanations it provides may just be coincidental rather than the true reasons behind the clustering.
Complexity: When doing the two-stage approach, the model often becomes unnecessarily complex [1]. In the example of a decision tree, a depth of 3 might be sufficient to fully represent the cluster boundaries, but CART might only achieve 100% accuracy when using a depth of 5 or 6. This is because of the greedy nature of the decision tree algorithms.
Sub-Optimality: If we limit the model complexity, fitting a classifier after clustering often results in sub-optimal cluster labels [1].

Ideally, instead of fitting a classifier in a supervised manner, we would instead directly construct the clusters using a model, use a model to construct the clusters, therefore avoiding these issues. This task is known as Interpretable Clustering.

Interpretable Clustering Approaches

As mentioned before, interpretable clustering approaches directly partition a dataset while simultaneously providing explanations of the obtained clusters. While there are a variety of methods to achieve this, the most successful method is unsupervised clustering trees [1,2,3,4].

Example of a clustering tree (Image from Bertsismas et al. 2021 [1])

Unlike CART, which optimizes for the impurity of known labels in its splits, unsupervised clustering trees are constructed by optimizing a variety of clustering metrics through various optimization algorithms but the general idea behind all of the algorithms are the same: By directly constructing clusters via a tree (in contrast to a fitting a tree classifier), the resultant clusters can be as high-quality as a black-box clustering algorithm and the explanations will be faithful and relatively concise.

Apart from clustering trees, other approaches include using, Specialized Neural Networks [5,8], Polytope Machines [6], and Rectangular Rules [7].

All the approaches for interpretable clustering are constructed via three main paradigms that I will cover here:

Greedy Construction + Refinement
Mathematical Programming
Neural Clustering

Greedy Construction + Refinement

The first method for interpretable clustering is an approach I call Greedy Construction and Refinement. In this approach, the clustering algorithm first runs a standard black-box clustering and fits a classifier onto the generated labels (the greedy construction). Then the algorithm refines the created tree till various conditions around tree quality are met. This is the most popular method for constructing interpretable clustering trees.

The most well-known method that follows this paradigm is Explainable K-Means Clustering (ExKMC) [2]. This algorithm has two main phases.

In the first phase, ExKMC fits a normal K-Means clustering and saves the cluster centers, and then fits a tree with K leaves (one leaf per cluster) onto the clustering labels. The result of this first phase is a sparse tree with a small number of decisions. However, this tree is significantly constrained, a problem that will be fixed in the second phase.

In the second phase, ExKMC grows the tree to Kleaves by expanding out the leaves of the tree constructed by the first phase (understandably K > K). The idea is to find a tree with K` leaves that minimizes the following cost:

Here mu is the cluster centers discovered by the initial run of K-Means. This should be a familiar equation: its the same cost function as what K-Means clustering uses AKA Inertia. At each iteration of the algorithm, the tree aims to add the split that minimizes this cost the most, similar to how CART minimizes impurity.

The result of the ExKMC algorithm is an explainable tree that achieves a clustering cost close to that of K-Means clustering. An example of an ExKMC tree can be seen below:

Example of an ExKMC clustering tree trained on the 20newsgroups dataset (Image from [2])

This algorithm can be simply used by running pip install ExKMC.

Algorithms that follow the Greedy Construction + Refinement tree construction paradigm is that while they are often very fast compared to the other paradigms on small datasets. This is often due to the fact that they start with an initialization from a black-box clustering and often have low overhead.

Mathematical Programming

The second paradigm of constructing clustering trees is to use mathematical programming. While algorithms like ExKMC rely on iterative refinement of an initialized model, mathematical programming approaches instead build the clustering from the ground up by formulating the clustering as a constrained optimization problem. These algorithms use linear/integer/constraint/programming or boolean satisfiability to define the model and the metric they need to optimize (ex. Inertia, Silhouette Score, etc.).

While these take longer to run than the previous paradigm, they do have a key quality: Flexibility. As they are formulated using mathematical programming paradigms, it is significantly easier for users to customize these algorithms for their specific usages by adding constraints and regularization terms. Want to ensure two points are in different clusters? Well you can add that as a constraint. Would you like clusters to have approximately equal sizes? Well you can easily modify the objective of these formulations to include a regularization term to penalize this.

Some examples of approaches that perform interpretable clustering through optimization problems include:

ICOT [1]: This method defines a tree through Mixed-Integer programming and aims to optimize the silhouette score of the discovered clusters.
SAT Clustering Tree [4]: This method defines the clustering tree as a boolean satisfiability (SAT) problem and also provides guidance on how to add in subject matter expertise via adding extra SAT clauses.
Polytope Machine Clustering [6]: This approaches uses a mixed-integer linear programming to identify a set of polytopes, and users can understand the characteristics of each profile by examining how the polytopes seperate points.

Neural Clustering

The last, newest, paradigm for interpretable clustering is gradient-based clustering. The idea behind this paradigm is to use a architectures that can be trained through GPU-Friendly mini-batch gradient descent. This fact means that this paradigm is suitable for infinitely large datasets, since updating the model can be done with one batch at a time. However, this naturally comes with a tradeoff to the runtime on small datasets due to the overhead costs of mini-batching. Additionally, neural approaches offer the ability to potentially use or learn alternate representations of the data.

There are two main streams of neural clustering. The first stream is using a specialized neural network to represent the K-Means problem. These approaches represent the centroids of the K-Means problem as a single-layer network and generate soft clustering assignments by taking the distance of points to the centroids and applying a softmin (aka negative softmax). Additionally, these works often incorporate an auto-encoder mechanism into the network allowing the network to not just cluster the data, but also learn useful representations for other downstream tasks.

Some notable works following this stream include:

Kauffman et al. [5]: This work "neuralizes" the K-Means problem and explains the important features for each cluster by using Layerwise Relevance Propagation
TELL [8]: This work constructs a network extremely similar to what Kauffman et al. [5] create, however they demonstrate the ability to use an image auto-encoder to perform image clustering. Additionally, the learned centroids can be visualized when conducting image clustering.

The other, newer, stream of interpretable clustering is to instead use a soft/fuzzy versions of traditionally interpretable architectures, namely Soft/Neural Decision Trees [3]. For more information on Soft Decision Trees, I encourage reading my previous article on how they work:

Neural Networks as Decision Trees

TL;DR: Unlike nodes in normal decision trees which either send points to the right or the left, nodes in Soft/Neural Decision trees instead assign a fuzzy weight between zero and one to the right and left branches. This softness means that they are differentiable and can be trained through gradient descent. Unlike a soft classification-tree, which is trained using cross-entropy, the Soft Clustering Tree (SCT) is trained by optimizing a fuzzy clustering loss:

Here, w is the fuzzy cluster assignment weights for the points, m is a softness factor (m > 1) that controls how fuzzy our clustering is (higher = more fuzzy), and z is the centroids of the clusters.

Conclusion

Clustering is an important machine learning task, and being able to understand the qualities of each cluster is critical to gain useful insights into your dataset. Interpretable clustering approaches combine the two steps of clustering and profiling by providing a model that intrinsically explains how points are routed to each cluster.

While not as mainstream as other machine learning fields, interpretable clustering has been gaining a lot of traction in recent years. There are now a large number of approaches that aim to tackle this task and only a small number of approaches were covered here. However, I hope this article provides a good starting point into exploring the various interpretable clustering approaches.

Resources and References

If you are interested in Interpretable Machine Learning or Time Series Analysis, consider following me: https://medium.com/@upadhyan.
See my other articles on interpretable machine learning: https://medium.com/@upadhyan/list/interpretable-and-ethical-ai-f6ee1f0b476d.

References

[1] Bertsimas, D., Orfanoudaki, A., & Wiberg, H. (2021). Interpretable clustering: an optimization approach. Machine Learning, 110(1), 89–138.

[2] Frost, N., Moshkovitz, M., & Rashtchian, C. (2020). ExKMC: Expanding Explainable $ k $-Means Clustering. arXiv preprint arXiv:2006.02399.

[3] Cohen, E. (2023, May). Interpretable Clustering via Soft Clustering Trees. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research (pp. 281–298). Cham: Springer Nature Switzerland.

[4] Shati, P., Cohen, E., & McIlraith, S. (2023). Optimal decision trees for interpretable clustering with constraints (extended version). arXiv preprint arXiv:2301.12671.

[5] Kauffmann, J., Esders, M., Ruff, L., Montavon, G., Samek, W., & Müller, K. R. (2022). From clustering to cluster explanations via neural networks. IEEE Transactions on Neural Networks and Learning Systems, 35(2), 1926–1940.

[6] Lawless, C., Kalagnanam, J., Nguyen, L. M., Phan, D., & Reddy, C. (2022, June). Interpretable clustering via multi-polytope machines. In Proceedings of the aaai conference on artificial intelligence (Vol. 36, №7, pp. 7309–7316).

[7] Pelleg, D., & Moore, A. (2001, June). Mixtures of rectangles: Interpretable soft clustering. In ICML (Vol. 2001, pp. 401–408).

[8] Peng, X., Li, Y., Tsang, I. W., Zhu, H., Lv, J., & Zhou, J. T. (2022). XAI beyond classification: Interpretable neural clustering. Journal of Machine Learning Research, 23(6), 1–28.

Tags: Clustering Explainable Ai Getting Started Neural Networks Unsupervised Learning