Advanced Dimensionality Reduction Models Made Simple

Author:Murphy | View: 27673 | Time: 2025-03-23 12:02:42

When approaching a Machine Learning task, have you ever felt stunned by the massive number of features?

Most Data Scientists experience this overwhelming challenge on a daily basis. While adding features enriches data, it often slows the training process and makes it harder to detect hidden patterns, resulting in the (in)famous Curse of Dimensionality.

Moreover, in high-dimensional spaces surprising phenomena happen. To depict this concept with an analogy, think of the novel Flatland, where characters living in a flat (2-dimensional) world find themselves stunned when they encounter a 3-dimensional being. In the same way, we struggle to comprehend that, in high-dimensional spaces, most of the points are outliers, and distances between points are usually larger than expected. All these phenomena, if not treated correctly, may have disastrous implications for our Machine Learning models.

In this post, I will explain some advanced Dimensionality Reduction techniques used to mitigate this issue.

In my previous post, I introduced the relevance of Dimensionality Reduction in Machine Learning problems, and how to tame the Curse of Dimensionality, and I explained both the theory and Scikit-Learn implementation of the Principal Component Analysis algorithm.

In this follow-up, I will dive into additional Dimensionality Reduction algorithms, like kPCA or LLE, that overcome the limitations of PCA.

Do not worry if you haven't read my Dimensionality Reduction introduction yet. This post is a stand-alone guide as I will detail each concept in simple terms. However, if you prefer to know more about PCA, I'm positive this guide will serve your goal:

Dimensionality Reduction with Scikit-Learn: PCA Theory and Implementation

Curse of Dimensionality

The Curse of Dimensionality is for sure a significant challenge in Data Science, making algorithms computationally intensive. When dimensions in a dataset increase, data points become sparse, leading to non-trivial computational issues.

Humans struggle to conceptualize beyond 3 (or sometimes 2) dimensions, causing unexpected behavior in high-dimensional spaces.

Consider a 2D square: a random point inside it is unlikely to be near a border. However, in higher-dimensional spaces, the probability of a random point to be close to a border grows exponentially. In a 10,000-dimensional hypercube, the probability near a border exceeds 99.99999%. An experiment shows that the average distance between two random points in a 100,000-dimensional hypercube is 129.09, highlighting the vastness of high-dimensional spaces. I will leave you with the code to test these statements yourself.

articles/dimensionality-reduction/curse-of-dimensionality.ipynb at main · andreoniriccardo/articles

For Data Science practitioners, this implies that high-dimensional datasets are incredibly sparse, with training instances far apart, raising the risk of overfitting Machine Learning models.

Adding more training examples isn't practical due to the exponential increase in required points as features grow: To maintain a distance of 0.1 between each training point in a 100-feature dataset, you would need to add more points than the number of atoms in the universe. The only viable solution is Dimensionality Reduction, and I will explore techniques to achieve this without significant information loss.

For a broader introduction to the Curse of Dimensionality topic, I link this article.

Now it's time to present some of the most advanced Dimensionality Reduction algorithms!

Kernel PCA

Indeed the most diffused Dimensionality Reduction model, Principal Component Analysis (PCA), projects the data onto a lower-dimensional hyperplane, lowering the number of dimensions.

This simple concept struggles when dealing with datasets that can't be effectively projected on a plane without losing a considerable portion of the original information.

Take the synthetic Swiss-roll dataset as an example:

Projecting data onto a plane is meaningless: the non-linearity component of the original data is lost.

The most straightforward solution for applying the PCA algorithm to a vast variety of dataset is to exploit the Kernel Trick. I presented the Kernel Trick in detail in this post:

Support Vector Machine with Scikit-Learn: A Friendly Introduction

The Kernel Trick maps the data points of a lower-dimensional space to a higher-dimensional one. Support Vector Machine exploits the Kernel Trick to draw linear decision boundaries in high-dimensional spaces, resulting in complex non-linear boundaries in the original feature space.

The same technique can be applied to the PCA algorithm, resulting in the Kernel PCA (kPCA).

Applying the Kernel PCA in Python is easy thanks to the Scikit-learn (sklearn) library. In its simplest form, we just need to instantiate a KernelPCA() object, specify the type of kernel to use, and the output's number of dimensions.

# Import required libraries
from sklearn.datasets import make_swiss_roll
from sklearn.decomposition import KernelPCA

# Generate synthetic data
X_swiss, t = make_swiss_roll(n_samples=1500, noise=0.3, random_state=2)

# Instantiate a KernelPCA object, specifying the kernel type 
# and the output's dimensions
pca_swiss = KernelPCA(n_components=2, kernel='sigmoid', gamma=1e-3, coef0=1, fit_inverse_transform=True)
# Transform the original data
X_pca_swiss = pca_swiss.fit_transform(X_swiss)

# Plot the original vs. reduced data

As a result, we obtain the following reduced dataset:

Based on the type of Kernel we pick, kPCA outputs completely different results:

In this simple example, I experimented with 4 different types of Kernels: Linear, Sigmoid, Polynomial, and Cosine. We can immediately appreciate how the Linear and Sigmoid Kernels produce acceptable results with the provided dataset. The Polynomial and Cosine Kernels appear to be poorly suited for this dataset.

At this point, you might be asking yourself the following question:

Is there a robust method for choosing the right Kernel?

The next section aims to solve this issue!

How to Select the Kernel

Being Kernel PCA an unsupervised learning model, we can't simply employ the Grid Search technique and pick the best hyperparameters based on a performance measure. As a consequence selecting the type of Kernel to use is tricky.

If kPCA is simply a preprocessing step of your Machine Learning algorithm, followed by a classification or regression model, we can indeed opt for a Grid Search approach. The idea is to treat the type of Kernel of kPCA as one of the hyperparameters of the Machine Learning model and pick the one that yields the best performance measure.

If we are treating the kPCA algorithm as a stand-alone model, an alternative consists of choosing the Kernel that yields the lowest reconstruction error.

To do that with Scikit-learn, we can exploit the inverse_transform() method:

from sklearn.datasets import make_swiss_roll
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import KernelPCA

X_swiss, t = make_swiss_roll(n_samples=1500, noise=0.3, random_state=2)

kPCA_dict = {
    "linear": KernelPCA(n_components=2, kernel='linear', fit_inverse_transform=True),
    "sigmoid": KernelPCA(n_components=2, kernel='sigmoid', gamma=1e-3, coef0=1, fit_inverse_transform=True)
}

X_pca_swiss = {}
X_preimage = {}

for kernel_type in kPCA_dict.keys():
    X_pca_swiss[kernel_type] = kPCA_dict[kernel_type].fit_transform(X_swiss)
    X_preimage[kernel_type] = kPCA_dict[kernel_type].inverse_transform(X_pca_swiss[kernel_type])
    mse = mean_squared_error(X_swiss, X_preimage[kernel_type])
    print('{} kernel -> MSE: {}'.format(kernel_type, mse))

We can see how the kPCA with the Sigmoid kernel performs slightly better than the one with the Linear kernel.

LLE

Completely different from what we saw up to this point, there exists a powerful Dimensionality Reduction technique that doesn't rely on projections. Locally Linear Embedding (LLE) focuses on preserving local relationships within the data.

In more simple terms, imagine you have a bunch of datapoints in a high-dimensional space, where each dimension represents a different feature. LLE focuses on small groups of neighbouring points, rather than looking at the entire dataset at once. It tries to represent these nearby points in a lower-dimensional space, while maintaining their relationships.

The consequent question you might ask now is:

What does it mean to maintain the relationships between datapoints?

Without involving complex math, for LLE to preserve relationships between datapoints means maintaining the intrinsic geometric structure of the data in a lower-dimensional space. Imagine you have a high-dimensional dataset, where certain points are close to each other and share some kind of meaningful relationship. The relationship may depend on several factors like proximity, similarity, connectivity, etc. What it matters is that LLE aims to keep these relationships in place when reducing the dimensionality of the data.

As a result, LLE captures the underlying structure of the data, especially when the relationships between points are complex and (or) non-linear.

In order to compare the LLE algorithm with the kPCA, I will apply LLE to the same dataset and plot the results:

from sklearn.datasets import make_swiss_roll
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.decomposition import KernelPCA

X_swiss, t = make_swiss_roll(n_samples=1500, noise=0.3, random_state=2)

lle = LocallyLinearEmbedding(n_components=2, n_neighbors=15)
X_pca_swiss_lle = lle.fit_transform(X_swiss)

pca_swiss = KernelPCA(n_components=2, kernel='sigmoid', gamma=1e-3, coef0=1, fit_inverse_transform=True)
X_pca_swiss_pca = pca_swiss.fit_transform(X_swiss)

Differently from kPCA, LLE appears to "understand" better the patterns and relationships of the data and is able to unroll the Swiss Roll dataset.

PCA vs LLE

I want to conclude this post about advanced Dimensionality Reduction techniques by delineating the differences between the most commonly used algorithms, and when it is preferable to apply one instead of the other.

PCA and LLE first diverge on the approaches they use. While LLE scans the dataset under a local perspective, focusing on preserving the relationships only between "adjacent" points, PCA takes a global viewpoint, capturing the overall dataset variance.

The range of application of these two different algorithms also differs. By focusing on a small cluster of data, LLE is able to capture complex and non-linear patterns. Instead, PCA, even with the kPCA variation, may fall short at capturing more intricate patterns like the Swiss Roll example we saw before.

Finally, the way these algorithms are affected by noise is a further aspect of difference. PCA, by considering the entire datapoints, tends to be more heavily impacted by noisy datasets, potentially leading to accuracy losses.

These distinct features of LLE and PCA make them suited for different sets of tasks and types of datasets. LLE is excellent at capturing intricate local structures. Instead, PCA is preferred when the global representation of the dataset and the preservation of the overall variance is a crucial aspect.

Finally, as the topic of Dimensionality Reduction goes way beyond what a simple Medium post achieves to explain, I recommend diving deeper into this fundamental Machine Learning argument by reading the material I provide in the "Reference" section.

If you liked this story, consider following me to be notified of my upcoming projects and articles!

Here are some of my past projects:

Outlier Detection with Scikit-Learn and Matplotlib: a Practical Guide

Social Network Analysis with NetworkX: A Gentle Introduction

Use Deep Learning to Generate Fantasy Names: Build a Language Model from Scratch