New Ways for Optimizing Gradient Descent
These ways will take your deep learning application to the next level of accuracy.
The new era of machine learning and artificial intelligence is the Deep learning era. It not only has immeasurable accuracy but also a huge hunger for data. Employing neural nets, functions with more exceeding complexity can be mapped on given data points.
But there are a few very precise things which make the experience with neural networks more incredible and perceiving.
Xavier Initialization
Let us assume that we have trained a huge neural network. For simplicity, the constant term is zero and the activation function is identity.
For the given condition, we can have the following equations of gradient descent and expression of the target variable in terms of weights of all layer and input a[0].
For ease of understanding, let us consider all weights to be equal, i.e.
Here we have considered the last weight different as it will give the output value and in case of binary classification it may be a sigmoid function or ReLu function.
When we replace all the weights in the expression of the target variable, we obtain a new expression for y, the expression of prediction of the target variable.
Let us consider two different situations for the weights.
In case 1, when we advance the weight to the power of L-1, assuming a very large neural network, the value of y becomes very large. Likewise, in case 2, the value of y becomes exponentially small. These are called vanishing and exploding gradients. These provisions affect the accuracy of gradient descent and demand more time for training the data.
To avoid these circumstances we need to initialize our weights more carefully and more systematically. One way of doing this is by Xavier Initialization.
If we consider a single neuron as in logistic regression, the dimension of the weight matrix is defined by the dimension of a single example. Hence we can set the variance of weights as 1/n. As we increase the dimension of input example, the former ‘s dimensions must be increased to train the model.
Once we have applied this technique to deeper neural networks, the weight initialization for each layer can be expressed as
similarly, there can be various ways to define the variance and multiplying with the randomly initialized weights.
Improvising Gradient Computation
Let us consider a function f(x) = x³ and compute its gradient at x = 1 using calculus. Using this simple function has a reason to understand and admire this concept. By differentiation, we know that the slope of the function at x=1 is 3.
Now, let us calculate the slope at x=1 by calculus. We find the value of the function at x = 1+delta, where delta is a very small quantity (say = 0.001). The slope of the function becomes the slope of the hypotenuse of the yellow triangle.
Hence the slope is 3.003 with an error of 0.003. Now, let us define the error differently and again calculate the slope.
Now we are calculating the slope of a bigger triangle formed by boundaries of 1-delta and 1+delta. Calculating the slope in this manner has reduced the error significantly to 0.000001.
Hence, we can infer that defining the slope in this manner will help us to better calculate the slope of a function. This demonstration helps us to optimize gradient calculation hence optimizing the Gradient descent.
One thing to note is implementing this function to calculate gradients more efficiently will increase the time required to calculate the gradients.