Unsupervised Learning Series -Exploring Self-Organizing Maps

Author:Murphy  |  View: 25355  |  Time: 2025-03-23 13:09:43
Image by teckhonc @Unsplash.com

Self-Organizing Maps (SOMs) are a type of unsupervised neural network utilized for clustering and visualization of high-dimensional data. SOMs are trained using a competitive learning algorithm, in which nodes (also known as neurons) in the network compete for the right to represent input data.

The SOM architecture consists of a 2D grid of nodes, where each node is associated with a weight vector that represents the means of the centroids in the SOM solution. The nodes are organized in such a way that nodes are organized around similar data points, producing a layer that represents the underlying data.

SOMs are commonly used for a wide array of tasks such as:

  • data visualization
  • anomaly detection
  • feature extraction
  • clustering

We can also visualize SOMs as the most simple neural network version for unsupervised learning!

While they seem confusing at first, Self-Organizing Maps (or Kohonen Maps, named after their inventor) are one interesting type of algorithm that is able to map the underlying structure from the data. They can be described as follows:

  • a one-layer unsupervised neural network, without backpropagation.
  • a restricted k-means solution, where nodes have the ability to influence the movement of other nodes (in the context of k-means, the nodes are known as centroids).

In this blog post, we'll do a couple of experiments on the SOM model. Later, we'll apply a Self-Organizing Map to a real use case, where we will be able to see the main features and shortcomings of the algorithm.


Understanding how SOMs Learn

To understand how SOMs learn, let's start by plotting a toy dataset in 2 dimensions.

We'll create a numpyarray with the following dataset and plot it afterwards:

import numpy as np
X = np.array([[1, 2], [2, 1], [1, 3], [1, 2.5], [3.1, 5], [4, 10], [3.6, 5.4], [2, 3]])
Toy Dataset Plotting – Image by Author

These 8 data points are arranged using X and Y axis, representing arbitrary variables.

How do we start to train a Self-Organizing Map on this dataset? The first important feature of SOMs is that it is an algorithm that relies on several hyper-parameters. It's very important to understand how they work and how they impact the learning process.

The first set of hyper-parameters defines the size of the SOM, commonly called Lattice. This value is similar to k-means's number of centroids and to a neural networks layer number of neurons. It provides the number of elements that we are going to use to reduce our data points.

For this example, we are going to build a simple 2 by 2 lattice. This means that we will have 4 centroids (neurons) that we will use to represent our full dataset.

Learning 1: The size of the lattice is one of SOM's hyper-parameters

Let's plot them:

plt.scatter(X[:,0], X[:,1])
neurons = [[2,4], [2,8], [3,4], [3,8]]

for neuron in neurons:
      plt.scatter(neuron[0], neuron[1], c='orange', s=400, edgecolor='black')
2 by 2 Lattice – Image by Author

In orange, we can see our Lattice being represented as centroids in our data. We can also represent these data points in "Hidden Layer" format, as if we are speaking of a Neural Network, where we can imagine that each neuron is represented by a value (weight) from the X and Y features:

2 by 2 Lattice in Network Format – Image by Author

Black lines represent the weight (value) of the x-axis and the red dashed line represent the weight (value) of the y-axis.

Learning 2: We can speak about the SOM's units as nodes (neural network terminology) or centroids (clustering terminology)

Step 1 of SOM Training: picking a random point and calculate the euclidean distance to each Neuron/Centroid:

First Step of SOM Training – Image by Author

This data point will choose the neuron with the shortest distance as the Best Matching Unit (BMU).

Learning 3: Best Matching Unit is the unit that our data point chooses as the most similar to itself

This unit is the neuron that is going to be dragged towards our data point.

But.. is it the only node that is going to move in this training iteration? No! All other centroids are also going to move, given how far they are from the BMU. This is a significantly different from K-Means, where centroids don't have the ability do drag other centroids in every iteration of the algorithm.

The BMU and other nodes **** will be dragged towards the data point. This movement will happen using a proximity function that will work as the "strength" that we will apply when pushing the nodes of our SOM.

One of the most usual ways to define this "proximity function" is to create a distance based gaussian approach. We start by calculating the distance between the nodes and the BMU:

Distance between nodes and BMU— Image by Author

In this case, we're using the euclidean distance, where distance = sqrt((x2-x1)² + (y2-y1)²).

Now, we need to input this distance inside a gaussian kernel to consider the influence that the BMU will have on the other nodes.

This Gaussian Approximation will be based on the following function:

Gaussian Kernel example – Image by Author

The numerator is the euclidean distance between neurons and the denominator multiplies the standard-deviation (σ) by 2 . Keep in mind that this standard deviation is another hyperparameter of the algorithm!

Let's see this visually. If we consider a smaller standard deviation, a gaussian kernel similar to this one will "draw" the following circle around the BMU:

Gaussian Kernel around BMU – Image by Author

The opacity of the shade on the image above is a proxy for how strong the nodes will be pulled towards the data point.

If we choose a higher standard deviation, the influence of our BMU is larger and more nodes will be affected:

Gaussian Kernel around BMU, larger standard deviation – Image by Author

Learning 4: The Standard Deviation is an Hyper-Parameter that we can configure on the SOM training

I'll call the result of the gaussian formula howclose – for example, the howclose between our BMU node (remember that it was chosen by the data point) and the other nodes is the following (using σ=1):

howclose of BMU vs. each data point – Image by Author

We now have the strength that we will apply on each node when pulling it towards the data point. Regarding neighborhing nodes, the data point where we will apply most strength will be the node on the bottom-right corner. On the reverse, we will barely move the nodes in the top-right and top-left.

We know the strength that we will use to update our SOM but we still need the direction where we will move each node. That can be achieved with the update formula:

Node Value Update Formula – Image by Author

Where:

  • wi,k is the current coordinates of variables i of the node k.
  • howclose is the strength that we've calculated before.
  • learningrate is another hyper-parameter of the SOM algorithm that is similar to other neural networks training process.
  • (xi-w i,k) is the direction to where we will push the node. For example, in our example, all nodes will be dragged down as the data point sits below every node (on the y-variable).

The result of the calculation above will give us the new coordinates of the existing nodes in the SOM! Another important detail: we don't have a calculation for the weight to apply to the BMU (only to neighborhing nodes). A common way to go around this is to apply maximum strength when dragging the BMU (output of gaussian function = 1).

What are the new coordinates for our nodes if we apply this logic? The calculations are available in the table below:

Node Value Update Summary— Image by Author

Let's see the example for the bottom-right node:

  • On the x-axis (first row of the table above), the new coordinate of this point will be 2.53, instead of 3. This is the result of applying the strength times the learning rate and updating the node towards the data point on the x variable.
  • On the y-axis (fourth row of the table above), the new coordinate of this point will be 3.377, instead of 4. This is the result of applying the strength times the learning rate and updating the node towards the data point on the y variable.

Additionally, notice that the movement on the nodes on the top-right and left are negligible. Why? Because the howclose applies a low strength when pushing these nodes towards the data points.

How will our BMU move? Naturally, it will be dragged almost to the top of our data point:

Updating BMU— Image by Author

Based on the coordinates we've obtain on the Updated Value table above, let's view our "new" SOM:

New Nodes in the SOM – Image by Author

The purple nodes are the updated "SOM" nodes. Our self organizing map is trying to mimic the underlying data points by adjusting itself based on the random point we chose. Let's see what happens in the 3 dimensional view:

New Nodes in the SOM – 3D View – Image by Author

Notice that our SOM moved a bit when we compare it with the original random network and the new black and dashed lines represent the new weights of the nodes.

Original Random SOM – 3D View – Image by Author

Our SOM will do this process for all the data points available in the dataset (adjust itself for each pass). Remembering each step:

  • Calculate the BMU (best matching unit) using a distance metric.
  • Calculate the kernel around the BMU to know how to pull other nodes in the SOM toward the data point.
  • Pull the nodes using a combination of strength, learning rate and direction.
  • Move to the next point and repeat.

Just like other neural networks model, the number of epochs in a SOM is another hyperparameter we need to define. An epoch consists of a single pass through the entire dataset.

Learning 5: The SOM will continue to adjust itself until it reaches the number of epochs defined by the user.

In Python, we can use the minisom package to train a Self-Organizing-Map:

from minisom import MiniSom

som = MiniSom(
 2, 2, 2, sigma=1, 
 learning_rate=1, neighborhood_function='gaussian', random_seed=42)
 som.train(X, 100)

The parameters of the Minisom function are the following:

  • The first three numbers represent the dimensions of the SOM: number of nodes in the x-axis, number of nodes in y-axis, and number of variables.
  • Sigma contains the starting standard deviation.
  • Then we have the starting learning rate of the algorithm (in this case, set to 1).
  • neighborhood_function defines the gaussian kernel for our neighborhood.
  • random_seed ensures reproducibility of our SOM.

After fitting and training this SOM (for 100 epochs, as specified in the som.train function) where will the nodes end? Let's see:

nodes = som.get_weights().reshape(4, 2)
 plt.scatter(nodes[:,0], nodes[:,1], c='purple', s=400, edgecolor='black')
 plt.show()
Trained SOM – Image by Author

Notice that our SOM tried to mimic the underlying data points as best as it could!

Although you may think that this is very similar to k-means, let's see the result from fitting a 4 centroid k-means solution on this dataset:

Notice that the node in the top right corner was drawn toward the outlier in this k-means solution. The fact that the nodes are tied to each other in the SOM using the neighboring function, is one of the main features of the algoritm, preventing nodes from being drawn away by outliers.

Ok, cool! We've had a nice introduction to SOMs, let's now see an application of SOM with real data!


Applying SOM to real data

For this part of the blog post, I'll use a dataset with information about S&P 500 companies:

financial_valuation_data = pd.read_csv('/data/constituents-financials_csv.csv')

Here are the top 5 rows of the dataframe:

financial_valuation_data top 5 rows – Image by Author

In this dataset, we can see the following columns:

  • Symbol: Ticker of the company.
  • Name: Name of the company.
  • Sector: Sector of the company.
  • Price: Current stock price of the company (as of 31st Dec 2014).
  • Price/Earnings: Ratio between price of the stock and earning-per-share.
  • Earnings/Share: Amount of earnings per each share for the company.
  • 52 Week Low: Minimum stock price in the last year.
  • 52 Week High: Maximum stock price in the last year.
  • Market Cap: Total value of the company, equals to the price of each share times the number of shares.
  • EBITDA: Earnings before interest, depreciation and amortization.
  • Price/Sales: Price divided by the revenue per share.
  • Price/Book: Financial valuation metric that divides the Price by the Book value of the company (assets-liabilities).

Additionally, I'll also create a variable that will emulate the volatility of the stock price (although this is not the theoretical formula of stock price volatility, it's a simple proxy):

financial_valuation_data['Volatility'] = (
 financial_valuation_data['52 Week High']-financial_valuation_data['52 Week Low']
 )/financial_valuation_data['52 Week High']

Cool! So let's start by subsetting a few columns to make our SOM simpler:

financial_valuation_data_subset = (
 financial_valuation_data[['Price/Earnings','Dividend Yield','Earnings/Share','Price/Sales', 'Price/Book', 'Volatility']]
)

Can you determine how many dimensions our SOM will have based on the subset above?

6, the number of variables we've just extracted! Our SOM from the toy example only had two dimensions (for visualization purposes). This new SOM will be harder to visualize, but luckily there are a few plots that will give us some insights on the trained model.

Another step that we didn't go through in the toy example is the process of standardization. Remember that Self-Organizing Maps start by calculating distances between the random data point and each node? Because of that (and as we normally use euclidean distance), we need to standardize our data! Let's use sklearn's StandardScaler to achieve that:

scaled_data = StandardScaler().fit_transform(financial_valuation_data_subset)
scaled_data = np.nan_to_num(scaled_data, nan=0)

I'll also impute some nans to 0. As we have few nas in our dataset, this won't hurt much.

With our dataset ready, we can try to train our SOM using the MiniSom package. Let's define the following hyperparameters:

  • A 15 by 15 lattice (225 nodes) – this is a large network!
  • 6 features (dimensions of the SOM).
  • A sigma of 0.5.
  • Learning Rate of 1.
  • Gaussian approximation function.
som = MiniSom(
    15,
    15,
    6,
    sigma=0.5,
    learning_rate=1,
    neighborhood_function='gaussian',
    random_seed=42
    )

And you may be thinking.. what's the number of epochs that I should use on my SOM?

A cool way to visualize that is to train our SOM in batches and calculate the quantization_error , a metric that checks the average distance between each data point and it's BMU:

np.random.seed(42)

num_iterations = 1000

q_errors = []
for i in range(1, num_iterations):
 som.train_batch(scaled_data, i)
 q_errors.append(som.quantization_error(scaled_data))

This will tell us how good our SOM is at replicating the underlying data points. Ideally, quantization error would be zero (almost impossible, as the goal of SOM is to reduce the dimensionality of our dataset and there must be some error involved).

SOM Training Process – Image by Author

Interestingly, after ~500 epochs, our quantization error is pretty stable. This plot gives us an overview on how many epochs we may target for the training process.

To be on the safe side, we'll train our SOM for 1000 epochs, using the same parameters as before:

som = MiniSom(
 15, 15, 6, sigma=0.5,
 learning_rate=1, neighborhood_function='gaussian', random_seed=42)
som.train(scaled_data, 1000)

We've just trained our SOM on this data! With that taken care of, we'll see how our data points relate to the map we've created.

The first row is represented by the following array:

scaled_data[0]
First Row Array – Image by Author

With a trained model, we can call the som.winner() function to retrieve the BMU (nearest SOM node):

som.winner(scaled_data[0])

The output of the code above is (2, 11) – this means that the node that is closest to this data point (at the end of the SOM training) is the one that sits on coordinates 2,11 (remember that Python indexes start on 0):

BMU of our first Data Point – Image by Author

Each node in the SOM above is characterized by an array of 6 values (the dimensions or number of variables). Continuing this process, let's see what is the BMU for the second data point:

som.winner(scaled_data[1])
BMU of our Second Data Point – Image by Author

We can generalize this behavior for all data points in our dataframe:

financial_valuation_data['winner_node'] = (
 [som.winner(scaled_data[i]) for i in range(0, len(scaled_data))]
)

A natural follow-up question is: "can we see which nodes have more data points near them?" (this is similar to k-means most representative clusters)

financial_valuation_data.groupby('winner_node').size().sort_values().tail(10)
Top 10 Nodes according to Data Point Representation – Image by Author

Another common plot we can do with trained SOMs is the distance map. This plot gives us the normalized average distance between each node and it's neighbours:

SOM Distance Map (U-Matrix) – Image by Author

In this distance map, we understand that there are 7 or 8 nodes that are relatively distant (red ones) from the rest of the SOM. If those nodes have data points attached to them, there may be something special about them – for example:

Distant Node in the SOM – Image by Author
(
 financial_valuation_data
 .loc[financial_valuation_data['winner_node']
 .apply(lambda x: (x[0] == 4) & (x[1] == 0))]
)

Filtering by the companies that have this node as the BMU, we obtain the following data:

Companies attached to BMU node – Image by Author

The next question is— why are these companies tied together in this node? Probably, because one of the 6 dimensions that we were analyzing have a very peculiar value on this node.

We can check this by looking at the value of the weight of the variables for each node:

Weight for Each variable by Node – Image by Author

This is another useful plot to visualize trained SOMs as it shows the influence of variables in the SOM.

For L Brands and Phillip Morris, it seems that they have a very values of Price to Book Ratio (> 1000) – this can be seen in the influence of Price/Book in the node. This may be something that requires further investigation, particularly because the average book/value of our dataset is 14.

Let's investigate more about our weights in the SOM by looking at other variables, for example checking the node areas with unusual price to earnings ratio:

PE Ratio Weights on the SOM Node

Starting with the high values of the nodes in the red region:

(
    financial_valuation_data 
    .loc[financial_valuation_data['winner_node']
         .apply(lambda x: (x[0] >= 9) & (x[0] <= 11) & (x[1] == 1))]
)
Companies with High P/E Ratios – Image by Author

On the other hand, let's see the blue shaded area, where we expect to have companies with small P/E ratios:

(
 financial_valuation_data
 .loc[financial_valuation_data['winner_node']
 .apply(lambda x: (x[0] >= 7) & (x[0] <= 8) & (x[1] == 12))]
)
Companies with Low P/E Ratios – Image by Author

We only have one company associated with these nodes, with a very low P/E ratio! One interesting thing is that the node with this data point is that it is also flagged as distant from other nodes in the distance_map .

Having a trained SOM helps us understand the underlying structure of the data, particularly on how data points cluster around these centroids and what variables relate to them. Another common thing do with SOMs is to try to understand how different categories cluster around nodes. For example, are we able to capture any industry trend on the trained SOM?

We can plot the industry of every data point inside each node to assess that:

Data Points by Industry in the SOM – Image by Author

Each square represents a SOM node and inside the square we plot each data point attached to that node. With the variables that we have chosen for the network, can we identify a trend on any industry?

(Spoiler: there's actually two industries where our SOM may be able to identify and cluster industries together very well).

Let's subset "Real Estate" and "Utilities" companies on the plot above:

Real Estate and Utilities Companies in the SOM – Image by Author

Look how nicely our data points cluster together in few nodes! If we compare this with the "Information Technology" industry:

Information Technology Companies in the SOM – Image by Author

If we wanted to do some supervised model on top of this data (predicting industry), this SOM would give us some good insights on why we would have trouble predicting the "Information Technology" class with the variables that we are using.

For more examples, check the awesome JustGlowing (creator of the minisom package) example repo.


Conclusion

Thank you for taking the time to read this post!

SOMs are very interesting overlooked algorithms able to detail the structure of our dataset. They may help in a variety of tasks, such as detecting outliers (in large SOMs) or building clustering algorithms that are robust to them (smaller SOMs). SOMs have historically been used in diverse problems such as Image Compression, Detection Systems and in problems related to GeoSciences.

When compared with K-Means and Hierarchical Clustering, Self-Organizing Maps tend to be more robust to outliers, particularly for the same number of clusters (don't confuse this with the fact that large SOMs are able to identify edge cases in the data).

In a nutshell, you can use them for a variety of Data Science tasks, such as:

  • Pre-processing steps (identifying outliers using Large SOMs, understanding high correlation between features).
  • Understand your features ability to predict categories.
  • Build clustering algorithms.

If you enjoyed this post, make sure to read my other posts on the Unsupervised Learning series:

The dataset we've used in this post is under Open Data Commons License and is available here

Tags: Artificial Intelligence Data Science Editors Pick Machine Learning Self Organizing Map

Comment