How t-SNE Outperforms PCA in Dimensionality Reduction

In machine learning, dimensionality reduction refers to reducing the number of input variables in the dataset. The number of input variables refers to the dimensionality of the dataset.
Dimensionality reduction techniques are mainly divided into two main categories: Linear and Non-linear (Manifold).
Under linear methods, we have discussed Principal Component Analysis (PCA), Factor Analysis (FA), Linear Discriminant Analysis (LDA) and Non-Negative Matrix Factorization (NMF).
Under non-linear methods, we have discussed Autoencoders (AEs) and Kernel PCA.
t-Distributed Stochastic Neighbor Embedding (t-SNE) is also a non-linear dimensionality reduction method used for visualizing high-dimensional data in a lower-dimensional space to find important clusters or groups in the data.
All dimensionality reduction techniques fall under the category of unsupervised machine learning in which we can reveal hidden patterns and important relationships in the data without requiring labels.
So, dimensionality reduction algorithms deal with unlabeled data. When training such algorithms, the fit() method only needs the feature matrix, X as the input and it does not require the label column, y – Source: Non-Negative Matrix Factorization (NMF) for Dimensionality Reduction in Image Data
What you will learn:
---------------------------------------------------
01. Advantages of t-SNE over PCA
02. Disadvantages of t-SNE
03. How t-SNE works
04. Python implementation of t-SNE - TSNE() class
05. Important arguments of TSNE() class
06. KL divergence
07. The MNIST data in tabular format
08. Visualizing MNIST data using PCA
09. Visualizing MNIST data using t-SNE
10. PCA before t-SNE (Very special trick)
11. Choosing the right value for perplexity
12. Changing the right number of iterations
13. Randomness of initialization
14. PCA initialization
15. Using a random state in random initialization
Other dimensionality reduction methods
---------------------------------------------------
1. Principal Component Analysis (PCA)
2. Factor Analysis (FA)
3. Linear Discriminant Analysis (LDA)
4. Non-Negative Matrix Factorization (NMF)
5. Autoencoders (AEs)
6. Kernel PCA
Advantages of t-SNE over PCA
There are mainly two advantages of t-SNE over PCA:
- t-SNE can preserve the spatial relationship between data points after reducing the dimensionality of the data. It means that the nearby data (points with similar characteristics) in the original dimension will still be nearby in the lower dimension! That is why t-SNE is mostly used to find clusters in the data.
- t-SNE can handle non-linear data which is very common in real-world applications.
PCA tries to reduce dimensionality by maximizing variance in the data while t-SNE tries to do the same by keeping similar data points together (and dissimilar data points apart) in both higher and lower dimensions.
Because of these reasons, t-SNE can easily outperform PCA in dimensionality reduction. Today, you will see this in action with an actual implementation of both t-SNE and PCA on the same dataset. You can compare the outputs of both methods and verify the fact!
Disadvantages of t-SNE
- The major disadvantage of t-SNE is that it requires a lot of computation resources to execute the algorithm on large datasets. So, it is time-consuming to execute t-SNE when the dimensionality of data is very high. To avoid this, we will discuss a very special trick. You can get almost similar results but with significantly less time!
- Another disadvantage is that you will get unstable (different) results with t-SNE if you use random initialization even with the same hyperparameter values. You will learn more about this at the end.
You will also learn how to tune some of the most important hyperparameters that come with Scikit-learn's t-SNE algorithm for obtaining even better results!
So, just continue reading!
How t-SNE works
There are two probability distributions involved in t-SNE calculations. So, the algorithm is stochastic by nature as its name implies.
- In the high-dimensional space, we use Gaussian (normal) distribution to convert pairwise distances between data points into conditional probabilities.
- In the low-dimensional space, we use Student's t-distribution to convert pairwise distances between data points into conditional probabilities.
The KL divergence measures the divergence (difference) between these two probability distributions. It is the cost function that we try to minimize during the training of the algorithm.
So, t-SNE aims to locate the data points in the low-dimensional space in a way that it can minimize the KL divergence between the two probability distributions as much as possible.
t-SNE requires heavy computational resources and too much time to do these probability calculations, especially with large datasets. That's why t-SNE is extremely slow when compared to PCA. As a solution for this, we will discuss a very special trick.
Python implementation of t-SNE
In Python, t-SNE is implemented by using Scikit-learn's TSNE() class. Scikit-learn is the Python machine-learning library.
First, you need to import and create an instance of the TSNE() class.
# Import
from sklearn.manifold import TSNE
# Create an instance
TSNE_model = TSNE(n_components=2, perplexity=30.0, learning_rate='auto',
n_iter=1000, init='pca', random_state=https://towardsdatascience.com/how-t-sne-outperforms-pca-in-dimensionality-reduction-7a3975e8cbdb/None)
Important arguments of TSNE() class
There are many arguments in the TSNE() class. If we do not specify them directly, they take their default values when calling the TSNE() function. To learn more about those arguments, refer to the Scikit-learn documentation.
The following list contains detailed explanations of the most important arguments in the TSNE() class.
- n_components: An integer that defines the number of dimensions of the embedded space. The default is 2. The most common values are 2 and 3 which are used to visualize data in 2D and 3D spaces respectively as t-SNE is mostly used for data visualization.
- perplexity: The number of nearest neighbors to consider when visualizing data. This is the most important argument in the TSNE() class. The default is 30.0, but trying out a value between 5 and 50 is highly recommended as different values can result in significantly different results. You need to plot the KL divergence for different perplexity values and choose the right value. This is technically called hyperparameter tuning. In general, larger datasets require larger perplexity values.
- learning_rate: The TSNE() function uses stochastic gradient descent to minimize the KL divergence cost function. The stochastic gradient descent optimizer requires a proper learning rate value to execute the process. The learning rate determines how fast or slow the optimizer minimizes the loss function. A larger value may fail to converge the model on a solution while a smaller value may take too much time to converge.
- n_iter: The maximum number of iterations you set for gradient descent. The default is 1000.
- init: An initialization method. There are two options: "random" or "pca". The default is PCA initialization which is more stable than random initialization and produces the same results across different executions. If we choose random initialization, the algorithm will generate different results at different executions as TSNE has a non-convex cost function and the GD optimizer may stick at a local minimum.
- random_state: An integer to get the same results across different executions when using the random initialization which generates significantly different results each time we run the algorithm. Popular integers are 0, 1, 2 and 42. Learn more here.
Important methods of TSNE() class
- fit(X): Learns a TSNE model from the feature matrix, X. No transformation is applied here.
- fit_transform(X): Learns a TSNE model from the feature matrix, X and returns the TSNE transformed data.
TSNE_transformed_data = TSNE_model.fit_transform(X)
Important attributes of TSNE() class
- kldivergence: Returns the Kullback-Leibler divergence after optimization. The GD optimizer tries to minimize the KL divergence during training. Analyzing this divergence by setting different values for perplexity **** is a great way of choosing the right value for perplexity. More on this shortly!
Python implementation of PCA (Optional)
This article also includes the use of the PCA algorithm for visualizing MNIST data in a lower-dimensional space. Therefore, it is worth discussing the Python implementation of the PCA algorithm although it is optional here. I have already published in-detail articles for PCA, which can be found in the following links.
PCA articles: 3 Easy Steps to Perform Dimensionality Reduction Using Principal Component Analysis (PCA), Principal Component Analysis (PCA) with Scikit-learn, An In-depth Guide to PCA with NumPy.
The MNIST data in tabular format
There are different ways of loading the MNIST dataset which contains 70,000 grayscale images of handwritten digits under 10 categories (0 to 9).
- Using Scikit-learn API: We get the shape of (70000, 784) which is the required tabular format for the TSNE and PCA Algorithms. The dataset will be loaded as a Pandas data frame. There are 70000 observations (images) in the dataset. Each observation has 784 (28 x 28) features (pixel values). The size of an image is 28 x 28.
- Using Keras API: We get the shape of (60000, 28, 28) for the train test and (10000, 28, 28) for the test set. The data will be loaded as three-dimensional NumPy arrays. This format cannot be directly used in the TSNE and PCA algorithms. We need to reshape the MNIST data.
Note: To learn more about the differences between the two APIs, click here.
For now, we will load the MNIST dataset using the Scikit-learn API. To speed up the computation process when using TSNE and PCA, we only load a part (the first 10,000 instances) of the MNIST dataset.
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
image_data = mnist['data'][0:10000]
labels = mnist['target'][0:10000]
print("Data shape:", image_data.shape)
print("Data type:", type(image_data))
print()
print("Label shape:", labels.shape)
print("Label type:", type(labels))

We also normalize pixel values to use with PCA and t-SNE.
# Normalize the pixel values
image_data = image_data.astype('float32') / 255
Visualizing MNIST data using PCA
As you can see in the above output, the original dimensionality of MNIST data is 784 which cannot be plotted in a 2D plot. So, we need to reduce the number of dimensions to 2 by applying PCA.
Let's see the output.
from sklearn.decomposition import PCA
PCA_model = PCA(n_components=2)
PCA_transformed_data = PCA_model.fit_transform(image_data)
print("PCA transformed data shape:", PCA_transformed_data.shape)
print("PCA transformed data type:", type(PCA_transformed_data))

The PCA-transformed MNIST data has the shape of (10000, 2) which can now be plotted in a 2D plot.

It is clear that all data points are in a single cluster and we cannot see different clusters for each class label. This is not the representation we need.
Let's fix this by applying TSNE to the MNIST data.
Visualizing MNIST data using t-SNE
Now, we will apply t-SNE on the same dataset. When applying t-SNE, we will use default values for all arguments (hyperparameters) in the TSNE() class.
from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=2, perplexity=30.0)
TSNE_transformed_data = TSNE_model.fit_transform(image_data)
print("TSNE transformed data shape:", TSNE_transformed_data.shape)
print("TSNE transformed data type:", type(TSNE_transformed_data))

The TSNE-transformed MNIST data has the shape of (10000, 2) which can now be plotted in a 2D plot same as before.

It is clear that data points are separated as clusters according to their class labels. The nearby data points of the same class in the original dimension will still be nearby in the lower dimension!
t-SNE is more effective than PCA for dimensionality reduction as it keeps similar data points together (and dissimilar data points apart) in both higher and lower dimensions. and works well with non-linear data.
PCA before t-SNE – Combining both methods (Very special trick)
PCA can't maintain the spatial relationship between data points after Dimensionality Reduction although it executes really fast compared to t-SNE.
On the other hand, t-SNE is really slow with larger datasets, but it can preserve the spatial relationship between data points after dimensionality reduction.
To speed up the computation process of t-SNE, we apply PCA before t-SNE and combine both methods as in the following diagram.

First, we apply PCA to the MNIST data and reduce dimensionality to 100 (keep only 100 dimensions/components/features). Then, we apply t-SNE to the PCA-transformed MNIST data. This time, t-SNE only sees 100 features instead of 784 features and does not want to perform much computation. Now, t-SNE executes really fast but still manages to generate the same or even better results!
By applying PCA before t-SNE, you will get the following benefits.
- PCA removes noise in the data and keeps only the most important features in the data. By feeding PCA-transformed data into t-SNE, you will get an even better output!
- PCA removes multicollinearity between the input features. The PCA-transformed data has uncorrelated variables which are fed into t-SNE.
- As I already said, PCA reduces the number of features significantly. The PCA-transformed data will be fed into t-SNE ad you will get results very fast!
PCA_model = PCA(n_components=100)
PCA_transformed_data = PCA_model.fit_transform(image_data)
TSNE_model = TSNE(n_components=2, perplexity=30.0)
PCA_TSNE_transformed_data = TSNE_model.fit_transform(PCA_transformed_data)
plt.figure(figsize=[7, 4.9])
plt.scatter(PCA_TSNE_transformed_data[:, 0], PCA_TSNE_transformed_data[:, 1],
c=np.array(labels).astype('int32'), s=5, cmap='tab10')
plt.title('Lower dimensional representation of MNIST data - TSNE after PCA')
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.savefig("PCA_TSNE.png")

It took 100 seconds to run TSNE with all 784 features. After applying PCA, it only took 20 seconds to run TSNE with PCA-transformed data which has 100 features.
The PCA-transformed data accurately represents the original MNIST data as the first 100 components capture about 90% variance in the original data. We can confirm it by looking at the following plot. Therefore, it is reasonable to feed the PCA-transformed data to t-SNE in place of the original data.
pca_all = PCA(n_components=784)
pca_all.fit(image_data)
plt.figure(figsize=[5, 3.5])
plt.grid()
plt.plot(np.cumsum(pca_all.explained_variance_ratio_ * 100))
plt.xlabel('Number of components')
plt.ylabel('Explained variance')

Choosing the right value for perplexity
Perplexity determines the number of nearest neighbors to consider when visualizing data. It is the most important hyperparameter in t-SNE. Therefore, you need to tune it properly.
One option is that you can plot the KL divergences for different perplexity values and analyze how Kl divergence behaves when increasing the value of perplexity.
perplexity_vals = np.arange(10, 220, 10)
KL_divergences = []
for i in perplexity_vals:
TSNE_model = TSNE(n_components=2, perplexity=i, n_iter=500).fit(PCA_transformed_data)
KL_divergences.append(TSNE_model.kl_divergence_)
plt.style.use("ggplot")
plt.figure(figsize=[5, 3.5])
plt.plot(perplexity_vals, KL_divergences, marker='o', color='blue')
plt.xlabel("Perplexity values")
plt.ylabel("KL divergence")

The KL divergence is decreasing continuously when the perplexity value is increased! Therefore, we can't decide the right value for perplexity by only analyzing this plot.
As the second option, we need to run the TSNE algorithm several times with different perplexity values and visualize the results.

At perplexity = 5 (a lower value), clear clusters are formed. But, there is not enough space between the clusters and the points within clusters are well separated.
At perplexity = 30, 50 and 100 (right values), the space between the clusters increases and the points within clusters concentrate well.
At perplexity = 210 and 500 (excessive values), clusters tend to overlap each other, which is not required for us.
For the MNIST dataset, any perplexity value between 30 and 50 will be good enough.
Note: The optimal value of perplexity highly depends on the nature and volume of the data. In general, larger datasets require larger perplexity values. It is always good practice to start with the default value which is 30.
Changing the right number of iterations
Now, we will visualize the effect of the number of iterations used in the gradient descent optimization algorithm. TSNE uses stochastic gradient descent to minimize the KL divergence cost function.
The number of iterations determines how many times the optimizer performs gradient updates during optimization. The default is 1000. The minimum should be 250.
The number of iterations is specified in the _n_iter_ argument.
from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=2, perplexity=30.0, n_iter=250)
In this experiment, we will try out three different values which are 250, 500 and 1000.
250 iterations

500 iterations

1000 iterations

After running for 250 iterations, clusters are not clearly formed because the cost function is not fully minimized. So, we need to give the algorithm more time to run for further optimization. After running for 500 iterations, the separation of clusters is very clear. Once optimized, increasing the number of iterations will have little to no effect on changing the cluster formation! It will only increase the training time! In this case, 500 iterations are sufficient to optimize the algorithm.
Note: The selection of the number of iterations highly depends on the nature and volume of the data.
Randomness of initialization
There are two types of initialization for TSNE:
- Random initialization
from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=2, perplexity=30.0, init='random')
- PCA initialization (default)
from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=2, perplexity=30.0, init='pca')
# or
TSNE_model = TSNE(n_components=2, perplexity=30.0) # default
PCA initialization is more stable than random initialization and produces the same results across different executions.
In contrast, random initialization will generate different results at different executions as TSNE has a non-convex cost function and the GD optimizer may stick at a local minimum as in the following diagram.

If the initialization occurs near point A, the optimizer may stick at the local minimum and the cost function will not be fully minimized.
If the initialization occurs near point B, the optimizer may stick at the global minimum and this time, TSNE will generate a completely different output than in the previous case.
There may be many local minima in the cost functions. Therefore, different initializations may tend the algorithm to be stuck at a local minimum. This is the case for random initialization and PCA initialization will not suffer from this!
Using a random state in random initialization
If you still want to get stable results at different executions of the TSNE algorithm with the random initialization, specify the _random_state_ argument with an integer like 0, 1, or 42.
from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=2, perplexity=30.0, init='random',
random_state=42)
If you run this many times, you will get the same results. However, if you specify 0 for _random_state_, you will get different results than in the previous case.
from sklearn.manifold import TSNE
TSNE_model = TSNE(n_components=2, perplexity=30.0, init='random',
random_state=0)
Let's see two outputs:
Random state = 42

Random state = 0

You will get completely different outputs with different random state values!
Note: Specifying a random state will not prevent the algorithm from being stuck at a local minimum.
This is the end of today's article.
Please let me know if you've any questions or feedback.
Other dimensionality reduction methods you might be interested in
- Principal Component Analysis (PCA)
- Factor Analysis (FA)
- Linear Discriminant Analysis (LDA)
- Non-Negative Matrix Factorization (NMF)
- Autoencoders (AEs)
- Kernel PCA
Read next (Recommended)
How about an AI course?
Support me as a writer
I hope you enjoyed reading this article. If you'd like to support me as a writer, kindly consider signing up for a membership to get unlimited access to Medium. It only costs $5 per month and I will receive a portion of your membership fee.
Join my private list of emails
Never miss a great story from me again. By subscribing to my email list, you will directly receive my stories as soon as I publish them.
Thank you so much for your continuous support! See you in the next article. Happy learning to everyone!
MNIST dataset info
- Citation: Deng, L., 2012. The mnist database of handwritten digit images for Machine Learning research. IEEE Signal Processing Magazine, 29(6), pp. 141–142.
- Source: http://yann.lecun.com/exdb/mnist/
- License: Yann LeCun (Courant Institute, NYU) and Corinna Cortes (Google Labs, New York) hold the copyright of the MNIST dataset which is available under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA). You can learn more about different dataset license types here.
Designed and written by:Rukshan Pramoditha
2023–05–23