Mastering t-SNE: A Comprehensive Guide to Understanding and Implementation in Python

Author:Murphy | View: 25320 | Time: 2025-03-22 20:25:46

If robust machine learning models are to be trained, large datasets with many dimensions are required to recognize sufficient structures and deliver the best possible predictions. However, such high-dimensional data is difficult to visualize and understand. This is why dimension reduction methods are needed to visualize complex data structures and perform an analysis.

The t-Distributed Stochastic Neighbor Embedding (t-SNE/tSNE) is a dimension reduction method that is based on distances between the data points and attempts to maintain these distances in lower dimensions. It is a method from the field of unsupervised learning and is also able to separate non-linear data, i.e. data that cannot be divided by a line.

Why is dimension reduction needed?

Various algorithms, such as linear regression, have problems if the dataset contains variables that are correlated, i.e. dependent on each other. To avoid this problem, it can make sense to remove the variables from the dataset that correlate with another variable. At the same time, however, the data should not lose its original information content or should retain as much information as possible.

The Difference between Correlation and Causation

Another application is cluster analysis, such as k-means clustering, where we have to define the number of clusters in advance. Reducing the dimensionality of the dataset helps us to get a first impression of the information and, for example, to be able to estimate which are the most important variables and how many clusters the dataset could have. For example, if we manage to reduce the dataset to three dimensions, we can visualize the data points in a diagram. The number of clusters can then possibly be read from this.

In addition, large datasets with many variables also pose the risk of the model overfitting. In simple terms, this means that the model adapts too much to the training data during training and therefore only delivers poor results for new, unseen data. For neural networks, for example, it can therefore make sense to first train the model with the most important variables and then gradually add new variables that may further increase the performance of the model without overfitting.

What are the basic principles of t-SNE?

t-Distributed Stochastic Neighbor Embedding is based on several fundamental concepts, which are explained in more detail below to help you understand the basic features of the algorithm:

Similarity measure: The similarity between data points is visualized using distance measures that measure how far apart two points are. The Euclidean distance is typically used for this, where the difference between the coordinates of two points is formed in all dimensions. These differences are then squared and added together. In addition to calculating the distance, other similarity measures can also be used, such as cosine similarity. The angle between two vectors determines how similar they are. The smaller the included angle, the more similar the data points.

Probability distribution: The t-SNE uses two probability distributions. In high-dimensional space, an attempt is made to place all data points on a Gaussian normal distribution. A data point is positioned in the center of the bell curve and the remaining points of the dataset are placed on the Gaussian normal distribution based on their distance from this point. The closer a point is to the selected data point, the closer this point is to the center of the bell curve.

In low-dimensional space, on the other hand, the t-distribution is used, which is similar to the normal distribution with the difference that the edges are slightly higher and therefore the center is slightly lower. The t-SNE algorithm aims to find a way of converting the Gaussian distribution in high-dimensional space into a t-distribution in low-dimensional space while losing as little information as possible.

Difference between t-distribution and Gaussian normal distribution | Source: Author

Embedding function: To complete the algorithm, an embedding function is created that enables data points in high-dimensional space to be mapped to points in low-dimensional space. Each point in the high-dimensional space is assigned exactly one point in the low-dimensional space.

These basic principles form the basis for a good understanding of the t-SNE algorithm.

What is the idea behind t-SNE?

It is not easy to convert datasets from a high dimensionality to a low dimensionality while retaining as much information as possible from the dataset. The following figure shows a simple, two-dimensional dataset with a total of 50 data points. Three different clusters can be identified, which are also well separated from each other. The yellow cluster is furthest away from the other two clusters, while the purple and blue data points are closer to each other.

Two-dimensional data for t-SNE | Source: Author

The aim now is to convert this two-dimensional dataset into a lower dimension, i.e. into one dimension. The simplest approach for this would be to represent the data either only by its X or Y coordinate.

Dimensionality reduction by X- or Y-coordinate | Source: Author

However, it is clear that this simple transformation has lost much of the information in the dataset and gives a different picture than the original two-dimensional data.

If only the X coordinates are used, it looks as if the yellow and purple clusters overlap and that all three clusters are roughly equidistant from each other. If, on the other hand, only the Y coordinates are used for dimensionality reduction, the yellow cluster is much better separated from the other clusters, but it looks as if the purple and blue clusters overlap.

The basic idea of t-SNE is that the distances from the high dimensionality are transferred to the low dimensionality as far as possible. To do this, it uses a stochastic approach and converts the distances between the points into a probability that indicates how likely it is that two random points are next to each other.

More precisely, it is a conditional probability that indicates how likely it is that one point would choose the other point as a neighbor. Hence the name "Stochastic Neighbor Embedding".

Dimensionality reduction according to t-SNE | Source: Author

As you can see, this approach leads to a much better result, in which the three different clusters can be clearly distinguished from one another. It is also clear that the yellow data points are significantly further away from the other data points and that the blue and purple clusters are somewhat closer together.

How does the t-SNE algorithm work?

In the first step, the t-SNE algorithm calculates a Gaussian probability distribution for all points. This indicates how likely it is that two randomly selected points are neighbors. A high probability indicates that the two points are close to each other, while a low probability indicates a large distance between the points.

For any given yellow data point, this distribution might look like the diagram below. The other yellow data points are very close to the selected point and therefore have a high probability of being a neighbor of the selected point. The blue and purple points have partially similar distances and therefore overlap. However, they are all significantly less likely to be a neighbor and are therefore further out on the normal distribution.

Gaussian distribution of data points at t-SNE | Source: Author

The paper in which t-SNE was first described also states that the conditional probability of a point to itself is defined as zero. Therefore, a data point lies at the origin of the graph, as this point cannot be the neighbor to itself.

Mathematically, this curve is expressed as the conditional probability that two points

Tags: Data Science Data Visualization Getting Started Principal Component t-sne-explained