Courage to Learn ML: Tackling Vanishing and Exploding Gradients (Part 1)

Author:Murphy  |  View: 21332  |  Time: 2025-03-22 22:58:22

In the last installment of the ‘Courage to Learn ML‘ series, our learner and mentor focus on learning two essential theories of DNN training, gradient descent and backpropagation.

Their journey began with a look at how gradient descent is pivotal in minimizing the loss function. Curious about the complexities of computing gradients in deep neural networks across multiple hidden layers, the learner then turned to backpropagation. By decompose the backpropagation into 3 components, the learner learned about backpropagation and its use of the chain rule to calculate gradients efficiently across these layers. During this Q&A session, the learner questioned the importance of understanding these complex processes in an era of automated advanced deep learning frameworks, such as PyTorch and Tensorflow.


This is the first post of our deep dive into Deep Learning, guided by the interactions between a learner and a mentor. To keep things digestible, I've decided to break down my DNN series into more manageable pieces. This way, I can explore each concept thoroughly without overwhelming you.

Today's discussion promises to address this question by focusing on the challenge of unstable gradients, a major factor making DNN training difficult. We'll explore various strategies to address this issue, using an analogy of running a miniature ice cream factory, aptly named DNN (short for Delicious Nutritious Nibbles), to illustrate effective solutions. In subsequent posts, the mentor will talk about each solution in detail, showing how these solutions are implemented within the PyTorch framework.

Diving into the world of DNNs, we're going to use a unique analogy that I've been fond of – envisioning DNN as an ice cream factory. Curiously, I once asked ChatGPT what ‘DNN' might stand for in the realm of ice cream, and after 5 minutes of thinking, it suggested "Delicious Nutritious Nibbles." I loved it! So, I've decided to embrace this playful analogy to help demystify those daunting DNN concepts with a dash of sweetness and fun. As we delve into the depths of deep learning, imagine we're managers running a mini ice cream factory called DNN. Who knows, maybe one day, DNN ice cream will become a reality. It would be a real treat for ML/DL enthusiasts to enjoy!

Let's begin this journey by learning the basic structure of using PyTorch to train a NN. Always, let's rephrase it to a fundamental question –

Can you illustrate the backpropgation and gradient descent in PyTorch?

The basic code of training a NN in PyTorch helps understand the relationship and role of gradient descent and backrpropgation.

​import torch
​
# Define the model
model = CustomizedModel()
# Define the loss function
loss_fn = torch.nn.L1Loss()
# Define optimizer
optimizer = torch.optim.SGD(params = model.parameters(), 
                       lr = 0.01, momentum = 0.9)
​
epoches = 10
for epoch in range(epoches):
  # Step 1: Setting the Model to Training Mode:
  model.train() 

  # Step 2: Forward Pass - Making Predictions
  y_pred = model(X_train)

  # Step 3: Calculating the Loss
  loss = loss_fn(y_pred, y_train)

  # Step 4: Backpropagation - Calculating Gradients
  optimizer.zero_grad() # clears old gradients
  loss.backward() #  performs backpropagation to compute the gradients of the loss w.r.t model parameters

  # Step 5: Gradient Descent - Updating Parameters
  optimizer.step()

In this code snippet, loss.backward() is utilized to execute backpropagation. This process begins from the loss, not the optimizer, because backpropagation's purpose is to compute the gradients of the loss with respect to each parameter. Once these gradients are determined, the optimizer uses these gradients to update each parameter with a gradient descent step. The optimizer.step() method, as I view it, is appropriately named ‘step' to indicate a single update of all the model parameters. This can be thought of as taking a step along the loss surface during the optimization process.

For a better understanding of what "a step along the loss surface" means, I strongly suggest reading my post on gradient descent. In it, I use a video game analogy to vividly demonstrate how we navigate the loss surface. Also, you will love the title picture draw by ChatGPT.

Can you explain the concepts of vanishing and exploding gradients in neural networks? What are the primary causes of these issues?

Computing gradients in a deep neural network (DNN) is complex, primarily due to the extensive number of parameters spread across many hidden layers. Backpropagation simplified this calculation by computing the gradient of the loss regard to each parameter, applying the chain rule. This process, due to the chain rule, involves derivatives calculated through successive multiplications, where values can drastically change. It results gradients that might become extremely small or large as we move to the lower layers (layers that are close to the input layer). It's akin to tracing back through a series of intricate water filters to enhance water quality, layer by layer. The early layers' influence is hard to evaluate as it gets altered by subsequent layers. Such unstable gradients introduce significant challenges when training deep neural networks, leading to the phenomena known as "vanishing" (extremely small gradients) and "exploding" (excessively large gradients) gradient problems. This instability makes DNN training challenging and restricts the architecture to fewer layers. In other words, without addressing this issue, DNNs cannot achieve the desired depth.

If you wonder about why backpropagation involves successive multiplications, I suggest checking out my earlier article on backpropagation. In the post, I explained the calculations by breaking it down into 3 components and use code examples to make it easier to understand. I also explored why researchers generally favor deeper and narrower DNNs over wider and shallower ones.

I get that vanishing gradients can be an issue because the lower layers' parameters barely update, making learning difficult with such small gradients. But why are exploding gradients problematic too? Wouldn't they provide substantial updates for the later layers?

You're correct in noting that large gradients can lead to significant updates in parameters. However, this isn't always the case and large update is not always beneficial. Let me explain why exploding gradients prevent effective training in deep neural networks:

  1. Exploding gradients doesn't always means large updates, it can results in numerical instability and overflow. Exploding gradients don't always lead to large, meaningful updates. When training with computers, excessively large gradients can cause numerical overflow. For example, consider a network where the gradient at a certain layer is 10⁷, a large but feasible number. During backpropagation, this value, multiplied by other large values due to the chain rule, can exceed the limits of standard floating-point representation, resulting in overflow and the gradients becoming NaN (presents Not a Number). When updating parameters using parameter = parameter - learning_rate * gradient, if the gradient is NaN, the parameters also become NaN. This corrupts the forward pass of the network, making it incapable of generate useful predictions.
  2. Large updates aren't always beneficial. Large gradient values can lead to dramatic changes in parameters, causing oscillations and instability in the parameter updates. This can result in longer training times, difficulty in converging, and the potential to overshoot the global minimal of the loss function. Coupled with a relatively larger learning rate, these oscillations can significantly slow down the training process.
  3. Large gradients aren't necessarily informative. It's a misunderstanding that larger gradients are always more informative and lead to meaningful updates. In fact, if gradients become excessively large, they can overshadow the contributions of smaller, yet more meaningful, gradients. Imagine navigating the loss landscape, where large gradients, often influenced by outliers or noise, can misguide us in choosing our next step. Additionally, large gradients may benefit some layers but not others, resulting in imbalanced learning. For instance, large gradients in upper layers might lead to extreme small derivatives in layers using sigmoid activation functions. This can result in an imbalanced learning process within the network.

Why is it that the oscillation in weight updates caused by exploding gradients turns out to be harmful? I understand that Stochastic Gradient Descent (SGD) with small batch sizes also leads to oscillation, but why doesn't that cause similar issues?

You're right in noting that Stochastic Gradient Descent (SGD) with small batch sizes introduces some instability in the weight updates. However, this type of oscillation is both controllable and relatively minor, primarily because it originates from the data itself. Utilizing a small subset of training data typically won't deviate too far from the behavior of the full dataset, meaning the noise is manageable. Additionally, this manageable level of noise can enhance the model's generalizability by improving its insensitivity to minor data variations. Essentially, while navigating the loss surface, SGD might make decisions that slightly deviate from the most optimal path to the global minimum. However, these suboptimal steps aren't drastically far from the ideal ones. This approach reduces the likelihood of getting stuck in plateaus and saddle points, enabling movement away from these areas towards potentially better local minima or even the global minimum in the complex landscape of DNN.

On the other hand, the oscillation caused by exploding gradients is of a different behavior. Due to exploding gradients, we might make substantially large updates to parameters, leading to significant movements on the loss surface. These updates can be so large that they catapult the model to a position quite distant from where it was initially, or even farther from the global minimum. Unlike with SGD, where the steps are small and keep us close to our original position, exploding gradients can negate our previous progress, forcing us to redo all the hard work to find the global minimum.

To visualize it, think of it like playing an RPG video game, where we use magic (akin to an optimizer) to guide our movements towards the treasure located at the lowest point of the map. With the magic of SGD, we might stray from the best route, but we generally head towards the treasure. Thus, moving fast enough, we'll likely reach and get the treasure. But with exploding gradients, it's like being thrown randomly to a new, unfamiliar place on the map, requiring us to restart our exploration. In the worst-case scenario, we might end up at the farthest point from the treasure, making it nearly impossible to reach our goal with limited time and computational sources.

So how do we know whether a gradient is too large or too small? How'd we detect those problems?

Detecting unstable gradients is important for ensuring the effective learning of different layers at a consistent pace. If you suspect your model is suffering exploding or vanishing gradients, consider those methods to effective identify those problems:

  • Tracking training loss. Observing the training loss across epochs is a straightforward and valuable method. A sudden spike in loss or its transition to NaN might indicate that some weights have been updated too aggressively, potentially due to large gradients causing numerical overflow. This scenario often points to exploding gradients. Conversely, a plateau in the loss graph or minuscule decreases over several epochs could be a sign of vanishing gradients, suggesting that the weights are barely updating due to very small gradients. However, interpreting these signs isn't always straightforward and requires ongoing evaluation to determine whether the training process is progress or not.
  • Monitoring gradient directly. A more direct approach involves keeping an eye on the gradients themselves. However, inspecting each gradient individually can be impractical in large, deep networks, calculating the gradient norm offers a simplified yet effective alternative. The norm, which can be computed for all layers collectively or individually, focuses on the magnitude of the gradients, providing a single metric for comparison over time. This method is particularly useful for identifying exploding gradients, as excessively large values will always stand out. For vanishing gradients, small norms might hint at an issue, but they're less definitive. Visualizing gradient distributions through histograms can also highlight outliers and extreme values, and can be easily captured.
  • Monitoring weight updates and activation output. Since unstable gradients directly impact weight updates, tracking changes in weights is a logical step. Sudden significant large changes or shifts to NaN in weights indicate exploding gradients. If weights remain static over time, vanishing gradients could be the reason. Similarly, analyzing the activation outputs of each layer can offer additional insights into how weights and gradients are behaving.

For practical implementation, TensorBoard stands out as a popular tool for visualizing training progress. It supports both TensorFlow and PyTorch and allows for detailed tracking of losses, gradients and more. It can be a great tool to identify gradient-related issues.

Are you suggesting that, in DNNs, we aim for all layers to learn at an same pace?

We don't actually expect every layer in a DNN to learn at the same rate. Our goal is to have a consistent and balanced weight update across the entire network. Vanishing gradients, for instance, pose a problem because they lead to the upper layers (those closer to the output) learning more quickly and converging earlier due to larger gradients, while the lower layers (those closer to the input) lag behind with almost random weight adjustments due to small gradients. This can result in the network converging to a less-than-ideal local minimum, with only the upper layers properly trained and the lower layers nearly unchanged. In general, we want make sure each layer's updating rate align with its impact on the final model prediction. The problem with both vanishing and exploding gradients is that they break this balance, and unstable gradients cause layers get updated disproportionately. It's similar to trying to walk left by only turning your body without moving your feet, you won't get very far.

The objective, then, is to achieve stable and uniform weight updates across the network throughout the network. This is where adaptive learning rate optimizers come into play, offering significant benefits. They dynamically adjust the learning rate based on historical gradients, aiding in more efficient loss reduction.

Practically, frameworks like PyTorch and TensorFlow allow for the specification of different learning rates for each layer. This capability is particularly beneficial when fine-tuning pretrained models or during transfer learning. It allows customized learning rate adjustments per layer to suit specific requirements. Here are some useful discussion on Stack Overflow and PyTorch forums.

How do we generally address the issue of unstable gradients when training DNNs?

Imagine our DNN model as a mini-ice cream factory, where ingredients flow through various departments to produce ice cream that's then rated by customers at the end of the production line. Each department in this mini-factory represents a layer in the DNN, and the customer feedback on the ice cream's taste represents the loss gradient used to improve the recipe through backpropagation.

However, our factory faces a challenge: the customers feedback isn't being effectively communicated back through the departments. Some departments overreact to the feedback, making drastic changes (akin to exploding gradients), while others overlooked, barely making any adjustments (akin to vanishing gradients).

As the factory managers, our goal is to ensure that feedback is processed appropriately at each stage to consistently produce ice cream that delights our customers. Here are the strategies we can employ:

  • Set up factory properly (Weight Initialization). The first step is ensuring the factory operates smoothly and produces quality ice cream. Setting up the factory correctly is quick important. We need to ensure that the initial setting of our production line (akin to weights in a DNN) is set just right, not too strong nor too weak, to produce a base flavor that meets general customer preference. This is like choosing proper weight initialization in DNNs to avoid excessively small or large gradients, ensuring a stable flow of adjustments based on feedback.
  • Quality control at the beginning and during production (Batch Normalization). As the ingredients mix and progress through the production line (layers in a DNN), we'd introduce quality checkpoints to standardize the mix at various stages. This ensures that each batch of semi-finished product remains consistent, preventing any layer from producing ice creams that are too varied, which could make our customer feedback become misleading and make our adjustment less effective. This mirrors batch normalization in DNNs, where outputs of layers are normalized to maintain a stable distribution of activations, aiding in a smoother gradient flow.
  • Adjust feedback system to avoid overreactions (Gradient Clipping). When customer feedback arrives, it's crucial that no department overreacts and makes drastic changes based on one batch of feedback, which could throw off the entire production line. By implementing a system where extreme feedback (either too positive or too negative) is moderated or clipped, you ensure that changes are gradual and controlled, akin to gradient clipping in DNNs, which prevents gradients from exploding. The similar idea is using some special network architecture like ResNet with skip connections can also mitigate vanishing gradients.
  • Optimizing Workflow and Feedback Paths (Activation Functions). In our DNN ice cream factory, some departments act as messengers, shuttling semi-finished products forward and customer feedback backward. They're the key for ensuring the final product turns out correctly. If they don't transfer the intermediate products accurately, the end result could be a batch of ice cream that misses the mark. Similarly, the way they handle customer feedback and pass between departments is also important. If feedbacks are overlooked, important information might be missed, while overly amplified feedback can cause overreactions and dramatic changes in the factory. So, choosing the right communicators (activation functions) ensure the smooth transfer of both products and feedback and keep the production line efficient and responsive. Just like in our factory, choosing the right activation functions in a DNN ensures that we get accurate predictions and maintain stable gradients, avoiding the extremes of vanishing or exploding during backpropagation and ensure effective updates for different parameters.
  • Tailor feedback response intensity by department (Adaptive Learning Rates). Finally, not all departments should react to feedback with the same intensity. For example, the flavor department might need to be more sensitive to taste feedback than the packaging department. By adjusting how much each department learns from feedback (akin to setting adaptive learning rates in different layers of a DNN), you can fine-tune the factory's response to customer preferences, ensuring more targeted and efficient improvements.

Next time, I'll cover the topic of activation functions and weight initialization methods in detail. Rather than simply running through the formulas and listing their pros and cons, I plan to share the essential questions that have shaped my understanding – questions that frequently go unasked but are vital for grasping why each method and function is shaped the way it is. These discussions might even empower us to innovate our own functions and weight initialization methods.

For those keen to journey with me through this series, feel free to follow along. Your engagement – through claps, comments, and follows – fuels this endeavor. It's not just encouragement; it's the very heartbeat of this educational series. As I continue to refine my grasp on these topics, I often revisit and update my past posts, enriching them with new insights. So, stay tuned for more!

(Unless otherwise noted, all images are by the author)


Other posts in this series:


If you liked the article, you can find me on LinkedIn.

Tags: Artificial Intelligence Courage To Learn Ml Data Science Deep Dives Machine Learning

Comment