What is vanishing Gradient with an example?

Comparison while using ANN, CNN and RNN

Let’s first understand what is gradient? Gradient of a function with only one independent variable as y = f(x) is the derivative i.e. f'(x) of that function f(x) which is calculated with the help of slope of the tangent drawn at some point(x0, y0) on that curve, whereas the gradient of a function with multiple independent variables as z = f(x, y){where x, y are independent variables} is the vector nebla f(x, y) = [df'(x, y)/dx , df'(x, y)/dy] made up of partial derivatives of that function.

Gradient is a vector quantity, when its magnitude || nebla f(x, y) || is taken, it tells the step size, required to reach at the point of steepest ascent. When its gradient vector nebla f(x, y) is taken, it tells the direction of steepest ascent to reach at the top of the curve{here the curve is assumed as a mountain.}

Above discussion is just the mathematical way of explaing gradients. Lets understand the problem of Vanishing Gradients in ANNs, ANNs are artificial neural networks. Let’s take an example of Handwritten Digits Recognition using ANNs. During the recognition process lets say we take the data-set of input images with dimensions 28×28 i.e. 784 pixels. This will form an input layer with these many pixels.

Now, next step is to decide the hidden layers and number of neurons in each hidden layer. This can only be decided using intuition, and/or experimentation. Let’s say we have two hidden layers with five neurons in each layer.

Since, we are doing handwritten digit recognition task, which will fix the neurons in our output layer. Hence, we will have ten{numbers can only be between 0-9} neurons in output layer. Every layer is connected with its adjacent layer, where each neuron of previous layer is connected with every neuron of next layer. These connection have weights associated with each other. Network is trained to fix the value of weights, to find the solution of a problem.

Every neuron is a combination of two functions – one is summation and other one is activation function. Summation function is pretty obvious, whereas we have to decide the activation function. We have choices like sigmoid, tanh, relu etc. Here, we will use sigmoid activation function to understand the problem of vanishing gradient. Sometimes, It’s curve type introduces the problem of vanishing gradient.

After deciding above parameters, our next task is to decide the loss function to train the above network. Here, we take cross-entropy loss function, which is pretty much used for classification problems.

Our next task is to train that network. Initially weights are assigned at random in the network. Later forward and backward pass takes place, to train the network. During the forward pass , the network architecture is exposed to an input image pixels i.e. the intensity values of the pixels. These values are passed through the network to calculate the value of an output, which is called predicted output. As, we are doing classification which is a supervised learning process, Hence, the data-set has a list of actual outputs associated with every input image.

Therefore, cross entropy loss value is calculated between predicted output and actual output, to tune the weights of the network. Now the backward pass takes place, which is called back-propagation algorithm. This algorithm

To understand the concept of vanishing gradient with the help of sigmoid function, please refer the below link:

https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484

— Oscar Wilde.

Leave a comment

Design a site like this with WordPress.com
Get started