How optimization in deep learning work?

Tavish Aggarwal

March 15, 2025

Mathematical optimization is an extremely powerful field of mathematics that underpins much of what data scientists, implicitly, or explicitly, utilize regularly. Nearly all machine learning algorithms make use of optimization theory to obtain model convergence.

Take, for example, a classification problem, we seek to minimize "log loss" by choosing the optimal parameters or weights of the model. In general, mathematical optimization can be thought of as a mechanism by which algorithms learn. A robust understanding of mathematical optimization is an extremely beneficial skill set to have in the data scientist toolbox.

It enables the data scientist to have a deeper understanding of many of the algorithms used today and to solve a vast array of unique optimization problems. Hence, making this topic crucial to understand.

Optimization methods used in deep learning

Almost all the machine learning algorithms except random forest and decision tree essentially boil down to an optimization problem. An optimization problem involves defining an objective function that needs to be minimized to get the best results.

Let's take an example of defining a neural network for this blog as the neural network training problem is an optimization problem where we learn the best set of weights and biases to minimize the loss function. It is an "unconstrained optimization problem" as we have no constraints on the values of weights and biases.

Issues with the objective function in neural networks

The loss function in neural networks is defined as the \(F(x_i)\) (which depends on the weight and biases) and actual output \(y_i\). The average loss function across all data points is represented as \(G(w,b)\).

$$G(w,b) = {1 \over n}\sum^n_{i=1}L(F(x_i), y_i)$$

Refer to the post, In-depth understanding of Neural Networks and its working if you want to understand in depth about how a loss function is defined.

Empirical loss vs Theoretical loss

This is called empirical loss which is different from the theoretical loss. The theoretical loss is the actual loss that occurs if we consider all the data points (whether they are part of the training dataset or not) that we can have. However, we are not entirely wrong in using the empirical loss function as there have been studies that show empirical loss is similar to theoretical loss given the number of data points 'n' is large.

Okay, so we are almost good here. However, there is another issue in choosing the loss function \(F(x_i)\) in the objective function \(G(w,b)\) that we have defined above for the neural networks. We cannot define the loss function as \(\sum_{i=1}^N(y_i \ne p_i)\), as it doesn't help in understanding how the output of the neural network changes with delta change in the input. Also, since we know that neural networks train using backpropagation, we need a differential loss function.

Surrogate functions

Therefore we use surrogate functions which are the function that mimics the result that we want to measure. For example, cross-entropy loss for a classification scenario defined as \(\sum_{i=1}^C(y_i  log(p_i))\). Though \(p_i\) are not discrete values, we use them to find the discrete predicted class. Let's delve more into the benefits of using such functions.

Cross-entropy loss function

There are a lot of advantages that we get using surrogate functions like cross-entropy. It helps in differentiating between the wrong decisions that the model makes. If the loss function was a simple comparison between the input and output, all the wrong decisions would have been on the same pedestal.

Cross-entropy loss assigns a score (called "loss") that reflects how far off the predicted probabilities \(p_i\) are from the actual correct label \(y_i\).

Now, when calculating cross-entropy loss, there's a mathematical operation called a "logarithm" involved. This causes some interesting behavior:

  1. If the predicted probability \(p_i\) is very small (close to zero), the "push" to correct it becomes much stronger because the loss increases rapidly. It's like saying, "This prediction is way off, so we need to work hard to fix it."
     
  2. The "push" is represented by the differential (rate of change) of the loss. If \(p_i\) decreases (gets further from \(y_i\), the differential grows dramatically (For example, it increases by 100 times if \(p_i\) reduces by 10 times). This ensures the model adjusts more strongly for highly incorrect predictions.

The farther the predicted probability \(p_i\) is from what it should be \(y_i\), the bigger the adjustment the model makes during training to correct it. Therefore, helping in measuring how wrong or correct the model is for the discrete class classification problem.

Another advantage of the cross-entropy loss function is generalizability. When we use cross-entropy loss, the model learns to assign probabilities to different outcomes rather than just saying "right" or "wrong." This is helpful because real-world data often includes situations the model hasn't seen before (data outside the training set). By focusing on probabilities, the model can make more flexible and nuanced predictions for these unfamiliar cases.

On the other hand, a "true loss function" that only considers whether something is correctly or incorrectly classified is stricter and less adaptable. It doesn't teach the model much about how confident or uncertain it should be in its predictions. This lack of flexibility can make it harder for the model to handle new, unseen data points effectively.

Training a neural network with true loss often leads to overfitting. Overfitting can also happen while using the cross-entropy loss. The issue of overfitting can be addressed while using the cross-entropy loss using the early stopping criteria. This states that we should stop the training when the validation error starts increasing as we know that the training or the empirical error will keep decreasing as we increase the number of epochs or training.

Cross-entropy loss function

Alright. Having an understanding of issues with an objective function, let's move forward and discuss different optimization techniques available to perform optimization. 

Iterative Optimization Schemes

In general, there are three main categories of iterative optimization schemes. Namely, zero-order, first-order, and second-order, which make use of local information about the function from no derivatives, first derivatives, or second derivatives, respectively. To use each iterative scheme, the function \(f(x)\) must be a continuously differentiable function to the respective degree.

Zero-order iterative schemes

Zero-order iterative schemes are closely aligned with the grid search where a search over a certain range of possible values of the value \(x\) is performed to obtain the minimum functional value. These methods tend to be much more computationally expensive than methods that utilize higher orders.

First-order iterative schemes

First-order iterative schemes are iterative schemes that utilize local information of the first derivatives of the objective function. Most notably, gradient descent methods fall under this category. I have discussed gradient descent in the post, Simple Linear Regression detailed Explanation. Let's delve further by briefly defining how gradient descent works from an optimization perspective.

Gradient Descent

The deep learning optimization problem is an unconstrained optimization problem where the objective is to find the minimum of the function.

Definition of Gradient Descent

The gradient is a direction in which the value of x and y (considering a two-dimensional function) should move to bring down the value of the function. It measures the rate of change of the function value.

In other words, for a function \(f(\theta_1, \theta_2, \theta_3, . . . . \theta_n)\) in n dimensions, the gradient direction is defined as follows:

$$-({\partial f \over \partial \theta_1}, {\partial f \over \partial \theta_2}, . . . ., {\partial f \over \partial \theta_n})$$

This means we'll move in the directions \(-{\partial f \over \partial \theta_1}, -{\partial f \over \partial \theta_2}, . . . ., -{\partial f \over \partial \theta_n}\) for the individual variables \(\theta_1, \theta_2, . . . ., \theta_n\) respectively. Hence, the update step becomes:

$$(\theta_i)_{new} = (\theta_i)_{old} - \alpha.{\partial f \over \partial\theta_i}$$

where \(\alpha\) is a learning rate. It is an important parameter as it decides how steep the step should be in the direction where the function value is decreasing.

The reason we take the learning rate \(\alpha\) into consideration as it decides how big a step should we take in the negative gradient direction. We move a finite step from our current point to a new point and the gradient changes. Hence, \(\alpha\) helps in controlling this step as we might keep jumping back and forth.

Gradient Descent Process

Let's say we have a one-dimensional function such as \(f(x) = x^2 + 3x + 2\), finding the minimum/maximum is easy as we can differentiate the function to x and equate it to 0. Then, we take the second derivate to determine whether this extrema is a maximum or minimum. A second derivative greater than 0 implies a minimum and vice-versa.

NOTE: When multiple extrema of a function exist (i.e., multiple minimums or maximums), care must be taken to determine which is the global extrema.

In this case, we get x = -3/2. But to do so for a function with more than one argument, \(f(x_1, x_2, x_3, . . ., x_d)\) we need to find the gradients \({\partial f \over\partial x_1} = 0 \), \({\partial f \over\partial x_2} = 0 \), . . ., \({\partial f \over\partial x_d} = 0 \) and equate them all to 0.

This will give us a local minimum but we are not guaranteed a global minimum. Hence, \({\partial f \over\partial x_i} = 0 \) for i from 1 to \(d\) is a necessary but insufficient condition to find a global minimum.

So how can we find the global minimum and partial derivates of the function? Let's understand the complexity of why finding a local minimum is relatively a difficult problem.

Critical and saddle points: How gradient descent can be trapped in saddle points?

The condition \({\partial f \over\partial x_i} = 0 \) is necessary to get a local minimum but at the same time, it is also an essential condition for maxima. All such points where \({\partial f \over\partial x_i} = 0 \) for all \(x_i^s\) or in every direction are known as critical points which can be either local minima or local maxima of the function.

When we have a d-dimensional feature vector, there are 'd' such gradients (\({\partial f \over\partial x_1}\), \({\partial f \over\partial x_2}\), ......, \({\partial f \over\partial x_d}\) ). Let's denote a maximum by '+1' and a minimum by '-1'.

Hence, a local minimum is a place where (\({\partial f \over\partial x_1}\), \({\partial f \over\partial x_2}\), ......, \({\partial f \over\partial x_d}\) ) = (-1, -1,........., -1) for all the gradients. In simple words, a local minimum is defined as a point that is minimum in every direction.

Now, there are only 1 in \(2^d\) chances that partial derivates of function with all the dimensions to be a local minimum. Rest all of the points where for some dimensions the point is a local minimum or local maximum are called saddle points. Its name is inspired by the horse saddle.

Saddle point

As we can see in the diagram shown above, the point is minimum in one direction (A - B) and maximum in another direction (D - C).

NOTE: When we have high dimension function, it becomes exponentially unlikely for a critical point to be a local minimum. As there is a high probability of point turning out to be a saddle points.

How we can use saddle points to our advantage?

In the gradient descent scheme of things, there is a very high chance of the critical point being a saddle point. This is good as the critical point being a saddle point ensures that there is a way out. It ensures that we can move further down to a minimum by changing the direction of descent and hence not be stuck.

Now, there can be good (saddle points) and bad local minimums in an optimization problem. In case we get stuck in a local minimum, the chance that it is a bad local minimum is very low. The reason is that due to the high chances of meeting a saddle point, most of the time, we will be descending and hence will stop at a minimum close to the global minimum, i.e., at a good local minimum.

Next, let's roll back a bit and understand how the most effective version of gradient descent i.e., minibatch gradient descent helps optimize the minimization process.

Minibatch Gradient Descent

Earlier, we developed an understanding of the problems with objective function and how cross-entropy loss function can help, let's understand a simple way to optimize the minimization process of minibatch gradient descent.

The minibatch gradient descent involves training in minibatches. In other words, if we have a dataset of 20,000 data points, we train the network in minibatches of 50 data points.

The loss function also changes from \({1 \over n}\sum^n_{i=1}L(F(x_i), y_i)\) to \({1 \over m}\sum^m_{i=1}L(F(x_i), y_i)\) where 'm' is the size of the minibatch and 'n' is the total number of data points assuming both loss functions will give same losses. Here, n = 20,000 and m = 50. The weight updates are performed after every feedforward and backpropagation of a single batch. Hence, there will be 20,000/50 = 400 weight updates.

This helps us in the following ways:

  1. The training using the minibatch in one go in the GPU as the batch size has reduced. Hence, the time required for a single update becomes far less.
  2. The minibatch can further be divided into smaller minibatches and the feedforward and the backpropagation can be performed on these smaller minibatches parallelly. The gradient values for updates of a minibatch are the average of the gradients calculated using these smaller minibatches.

Though minibatch gradient descent is a great approach and provides computational advantages, there can be a bias created using the approach. 

For example, the first minibatch may contain taller people the second minibatch may contain shorter people, and so on. This would hamper the training process. To overcome this problem, we reshuffle the entire training dataset once and then perform the minibatch gradient descent. This is done so that the minibatch we choose should be able to represent the entire dataset in the best possible manner and not show bias.

So far so good, next let's discuss how we can choose the batch size. 

Choosing the batch size

The batch size that we choose is typically a power of 2 which can easily be accommodated in the hardware available memory. Since we are making a proxy while calculating the gradients on the batch size as compared to all of the training data points. So, how much is the error rate when we calculate the gradient on batches as compared to the entire training dataset?

To answer this we calculate the standard deviation of the sample mean is calculated as the standard deviation of the sample divided by the square root of the size of the sample. 

$$\sigma(\hat\mu) = {\sigma(x) \over \sqrt m}$$

where \(\sigma(\hat\mu)\) is the standard deviation of the sample mean, \(\sigma(x)\)  is the standard deviation of entire dataset, and \(m\) is the sample size.

We want \(\sigma(\hat\mu)\) to be small which in turn means that \(m\) should be large. Hence, there is a trade-off. It is also evident that as \(m\), the batch size increases by 100 times, the \(\sigma(\hat\mu)\) decreases by only 10 times. This is the law of diminishing returns. Hence, to use the GPU parallelism as well as to reduce the \(\sigma(\hat\mu)\), the batch sizes generally used are 32, 64, 128, etc.

NOTE: The batch size is also adjusted by considering the number of features in the input data point because of memory constraints on the hardware.

Momentum-based Gradient Descent

Let's start by understanding one more type of gradient descent optimization i.e., the momentum-based gradient descent method. This is also a first-order optimization scheme.

In vanilla gradient, to make an update to parameter \(\theta\) we were doing:

$$\theta_t = \theta_{t-1} - \alpha g$$

where \(g\) is the gradient \(\partial f \over \partial \theta_t\). In momentum-based gradient descent, we do

$$\theta_t = \theta_{t-1} - \alpha \mu$$

$$where, \mu_t = \mu_{t-1} + v.g$$

\(\mu\) is known as momentum. It is the sum of gradients of the previous iterations multiplied by a factor of \(v\) which now becomes a hyperparameter. \(t\) and \(\theta_t\) refer to the iteration number unlike the definition in the previous segments. Also, \(\theta\), \(\mu\), and \(g\) are vectors with their definitions for all the parameters.

There are some benefits of the momentum-based gradient descent. Even though the learning rate is big and we jump back and forth, it doesn't seem to affect when we apply momentum-based gradient descent.

The reason behind this is that the unstable gradient direction which makes it jump back and forth keeps getting canceled out and the other gradient though small keeps on accumulating and ultimately gets to the minimum. Hence, this helps in reaching the minimum faster. This can be visualized better using the following figure:

Gradient descent

We can see in the diagram shown above, that upon the addition, a and b tend to cancel each other while c and d accumulate.

On a side note. There is also another reason why Newton's method (which we will discuss later) is not a preferred choice in neural network optimization. For now, we can understand in a way that, in Newton's method, the function tends to move towards a critical point while minimizing a function. There is a far greater chance for a critical point to be a saddle point than a local minimum. Hence, Newton's method keeps getting attracted towards saddle points rather than the local minimum which is not the case in momentum.

Adagrad

This is a variant of the momentum-based method. It is exactly what the momentum-based gradient method does, but goes one step forward and does two extra things:

  • Adaptive Learning Rate: \(\alpha\) adapts to the shape of the function
  • Differential Learning Rate: \(\alpha\) is now a vector with different values for different features

It will automatically adjust the learning rates based on the shape of the function. Let's understand a mathematical explanation of how Adagrad helps achieve this.

Adagrad decides the learning rate by looking at the shape of the function. It considers \(g^2\) instead of \(g\) in the momentum equation.

$$r_t = r_{t-1} + \rho.g^2$$

$$\theta_t = \theta_{t-1} - {\epsilon \over \delta + \sqrt{r_t}}\otimes g$$

\(g^2\) tends to capture the curvature of the function citing adaptive learning rate. Note that \(\epsilon \over \delta + \sqrt{r_t}\) is the new learning rate vector. \(\epsilon\) and \(\delta\) are scalar hyperparameters. Note that there is element-wise multiplication between the learning rate vector \(\epsilon \over \delta + \sqrt{r_t}\) and the gradient vector \(g\).  Hence, there is a differential learning rate.

We now have introduced two more hyperparameters \(\epsilon\) and \(\delta\) instead of the one α present in the momentum gradient descent. This might look like a demerit but the value of these hyperparameters does not affect as much as \(\alpha\) did earlier.

There are problems with Adagrad one of which is that \(g^2\) sometimes tends to 0 for some of the parameters due to squaring and that is a problem as the update will not happen for those parameters. Also, due to the accumulation of large values of \(g^2\), there is a high chance that the update for these parameters stops.

Next, let's understand how we can modify Adagrad to tackle these problems.

RMSProp and Adam

To overcome the problem discussed above, we take the exponentially weighted average in RMSProp as follows:

$$r_t = \rho r_{t-1} + (1-\rho).g^2$$

$$r_t = {r_t \over 1 - \rho^t}$$

$$\theta_t = \theta_{t-1} - {\epsilon \over \delta + \sqrt{r_t}}\otimes g$$

If, for example, we take \(\rho=0.1\). We can see that RMSProp gives a higher weightage to the current gradient and less to the older accumulating gradients. This helps in mitigating the problem faced by Adagrad. There is a bias correction expression \(r_t = {r_t \over 1 - \rho^t}\) which needs to be performed while using the exponentially weighted average.

Adam is yet another gradient descent optimization method that uses both momentum and Adagrad but with the exponentially weighted average with the corresponding bias corrections.

$$\mu_t = \rho_{\mu}\mu_{t-1} + (1-\rho_{\mu}).g$$

$$\mu_t = {\mu_t \over 1 - \rho^t_{\mu}}$$

$$r_t = \rho_r r_{t-1} + (1-\rho_r).g^2$$

$$r_t = {r_t \over 1 - \rho^t_r}$$

$$\theta_t = \theta_{t-1} - {\epsilon \over \delta + \sqrt{r_t}}\otimes \mu_t$$

We can see that there are a lot of hyperparameters involved in Adam such as: \(\epsilon\), \(\delta\), \(\rho_{\mu}\), \(\rho_r\) which we need to be mindful of for the problem statement we are solving.

Also, we might want to know which of the above-studied optimization methods is the best. To answer that question, we should look if someone has solved a problem we are trying to attempt and see what optimizer has been used and then start with that optimizer and then keep modifying. It's all experimental after all.

Second-order iterative schemes

Second-order iterative schemes are iterative schemes that utilize local information of the first derivatives and the second derivatives of the objective function. Most notably, we have the Newton method (NM), which makes use of the Hessian of the objective function.

Calculating Hessian

Like gradient \({\partial f \over \partial x_i}, {\partial f \over \partial y_i}\) which measures the rate of change of the function value. Similarly, the curvature measures the rate of change of the gradient and is a second-order differential. In an n-dimension parameter space, the curvature is the second-order differential which is the Hessian. The Hessian is defined as a matrix whose elements are as follows:

$$Hessian = [{\partial^2f \over \partial x_i \partial x_j}]$$

The Hessian is a symmetric matrix of second derivatives that tells you about the curvature of a function. When we calculate the eigenvalues and eigenvectors of the Hessian:

The eigenvalues describe the magnitude of the curvature in certain directions. Big eigenvalues mean the curve is steep in that direction and small eigenvalues mean the curve is relatively flat in that direction. And, eigenvectors represent the directions in which the curvature occurs.

The curvature which is the Hessian defined above helps in minimizing the objective function if we are not happy with first order gradient optimization. But this turns out to be problematic as well if we have hessian very large in one direction and small in other direction.

Imagine trying to walk down a narrow and steep valley. In one direction, it's extremely steep, so you have to carefully take tiny steps. In the other direction, it's flat, so your progress is slow. This mismatch makes your descent frustrating and inefficient.

Similarly, a large Hessian eigenvalue in one direction means the curvature is steep (function changes rapidly). A small Hessian eigenvalue in another direction means the curvature is relatively flat (function changes slowly). This makes it difficult to calculate the inverse of the hessian which is needed in Taylor series expansion. Let's understand this concept next.

Taylor series expansion

The Taylor series is a local approximation of a function value in terms of polynomials as our overall function is very complex and difficult to understand. So we attempt to understand the function locally. It is a widely used approximation technique. 

Newton's method of optimizing (sometimes referred to as the Newton-Raphson method) uses Taylor series expansion for the minimization. Any function \(f\) can be written in the form of Taylor series as:

$$f(x) = f(x_0) + {\partial f^T \over \partial x} (x-x_0) + {1 \over 2}(x-x_0)^TH(x-x_0)$$

where \(\partial f \over \partial x\) and \(H\) are calculated at \(x_0\). It has a first-order term which is a gradient and a second-order term which is a Hessian.

There are a few key things to note here. The Taylor series approximation is valid locally. Also, we generally don't solve the terms of order higher than two as there is a trade-off. Though the approximation will be better, it becomes computationally very expensive.

We can see from the above Taylor series expansion that we are trying to estimate the step we need to take for the new point by looking at the shape of the function.

The eigenvalues of the Hessian correspond to gradient updates in certain directions. In the diagram shown above, we considered two directions one along the ridge where the function doesn't change rapidly, hence the gradient will be small and the corresponding eigenvalue will be small. In the other direction towards the center (which takes the point down), the function changes much faster, hence gradient will be big and the corresponding eigenvalue will also be big. This creates a problem in calculating the inversion of the Hessian matrix as we have discussed earlier.

Now if we want to find the minima of the Taylor series expansion approximation, we differentiate w.r.t \(x\) and we get:

$$g + H(x-x_0) = 0$$

$$x = x_0 + H^{-1}g$$

where \({\partial f \over \partial x} = g\) and we get the minimum of the 'approximation function'. 

Please note that the problem of highly different eigenvalues (i.e., if one eigenvalue is very large and the other is very small) can be faced above as we inverse the Hessian. When computing the Hessian inverse, we use numerical methods that depend on dividing by eigenvalues. If one eigenvalue is very large and another very small:

  • The division by the small eigenvalue causes instability (values blow up).
  • The Taylor series approximation relies on the accurate computation of the Hessian inverse, and these numerical instabilities make the process unreliable.

Such matrices are known as ill-conditioned matrices.

Alternate approach to minimize Taylor series expansion

Let's look at an alternative approach to minimize the Taylor series expansion function which will help in deciding the correct learning rate. We know from gradient descent that to find the minima we move in the negative gradient direction. Hence,

$$x = x_0 - \alpha g$$

$$x - x_0= - \alpha g$$

Substituting this in the Taylor series expansion above, we get

$$f(x) = f(x_0) + \alpha g^Tg + {1 \over 2}\alpha^2g^THg$$

Minimizing this function with respect to \(x\) by differentiating \(f(x)\) with respect to \(x\), we get

$$\alpha = {g^Tg \over g^THg}$$

This is how we can get the learning rate with the help of the gradient and the curvature. This method is also known as line search as we move along the line described by the gradient.

Let's further decode the formula that we have to get the \(\alpha\).

In the denominator, if \(g\) is the eigenvector, \(g^THg = vg^Tg= v\) then \(v\) is the eigenvalue of the Hessian in that direction. Hence, learning rate \(\alpha = {1 \over v}\).

Hence, in directions where the gradients are large, the learning rate/step size should be small and vice versa. In cases where we see that the objective function value is not decreasing, one should check the value of the gradient norm \(g^Tg\) to check if we are stuck in a flat region or not and then proceed accordingly.

This is how a local approximation can be performed for a given complex function.

Bonus: Problem of Vanishing and Exploding Gradients in neural networks

One of the major problems we face, while we build deep neural networks, is the problem of vanishing and exploding gradients as we start training deeper and deeper networks. Let's briefly summarize the steps involved in backpropagation before understanding why the gradients vanish or explode as it's a problem that happens in backward propagation.

NOTE: For detailed explanation,refer to the post, In-depth understanding of Neural Networks and its working.

For every layer \(l\) in the layers of the network \(L, L-1, . . . ., 1\) following steps are performed to perform backward propagation:

$$dH^l = (W^{l+1})^T.DZ^{l+1}$$

$$DZ^l = dH^l \otimes\sigma^{'}(Z^l)$$

$$DW^l = DZ^l.(H^{l-1})^T$$

$$Db^l = DZ^l$$

We can see that matrix \(DZ^{l+1}\) is used in the calculation of \(dH^{l}\), \(dH^{l}\) is used in the calculation of \(DZ^{l}\), \(DZ^{l}\) is used in the calculation of \(DW^{l}\) and so on from layer L to 1. In other words, \(DZ^{l}\) can be expressed in terms of the matrix \(DZ^{l+1}\).

In the post, Principal Component Analysis Detailed Explanation, we have covered SVD. I would recommend going through it if you want to learn/revise how SVD works. 

Using SVD, we can express a matrix M as:

$$M=UDV^T$$

where where \(D\) is a diagonal matrix and \(U\) and \(V\) are orthonormal matrices which means that \(U^TU = I\) and \(V^TV = I\) where \(I\) is the identity matrix. Note that \(U\) and \(V\) need not be of the same dimensions, which means that \(D\) is not necessarily a square matrix.

Now, for simplification purposes, let's take a network where all layers have the same neurons. Therefore, all the weight matrices are the same, meaning \(W^1\) = \(W^2\) =....= \(W^L\) where L is the number of layers. Moving forward, we can write:

$$DZ^1 = (DZ^{L+1})^L.F$$

Where \(F\) is the multiplication of the rest of the matrices. We shall ignore it and concentrate on \((DZ^{L+1})^L\) henceforth. Upon applying SVD, if we consider \(DZ^{L+1}\) to be some matrix \(M =UDV^T\), we have

$$DZ^1 = (UDV^T)^L=(D)^L$$

Hence, \(DZ^{1}\) is a diagonal matrix with all elements of \(D\) raised to a power \(L\), the number of layers.

Consider that the last layer has sigmoid activation. Hence, the gradient of the sigmoid activation function is \({e^{-x} \over (1+e^{-x})^2}\) which is always less than 1. Hence, if \(v_1\) is calculated using backpropagation of sigmoid, it'll be less than 1 and then raising it to the power 'L' will set the gradient in \(DZ^1\) equal to 0, thus causing the vanishing gradient problem.

On the other hand, if the matrix is ill-conditioned (as defined above), an exploding gradient problem might occur. Some of the ways to mitigate the vanishing and exploding gradients problem are:

Summary

Optimization lies at the heart of deep learning, enabling models to improve their performance by minimizing loss functions. From gradient descent to advanced techniques like momentum-based methods, Adagrad, and Adam, optimization methods adapt to the diverse landscapes of objective functions.

Understanding the role of surrogate loss functions, iterative optimization schemes, and challenges like vanishing gradients provides key insights into how neural networks learn effectively. With robust optimization strategies, we can tackle problems like saddle points, improve learning rates, and enhance model generalization.

A strong grasp of these concepts is essential for building efficient and accurate machine-learning models.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.