# New Ways for Optimizing Gradient Descent

## These ways will take your deep learning application to the next level of accuracy.

The new era of machine learning and artificial intelligence is the Deep learning era. It not only has immeasurable accuracy but also a huge hunger for data. Employing neural nets, functions with more exceeding complexity can be mapped on given data points.

But there are a few very precise things which make the experience with neural networks more incredible and perceiving.

# Xavier Initialization

Let us assume that we have trained a huge neural network. For simplicity, the constant term is zero and the activation function is identity.

For the given condition, we can have the following equations of gradient descent and expression of the target variable in terms of weights of all layer and input a.

For ease of understanding, let us consider all weights to be equal, i.e.

Here we have considered the last weight different as it will give the output value and in case of binary classification it may be a sigmoid function or ReLu function.

When we replace all the weights in the expression of the target variable, we obtain a new expression for y, the expression of prediction of the target variable.

Let us consider two different situations for the weights.

In case 1, when we advance the weight to the power of L-1, assuming a very large neural network, the value of y becomes very large. Likewise, in case 2, the value of y becomes exponentially small. These are called vanishing and exploding gradients. These provisions affect the accuracy of gradient descent and demand more time for training the data.

To avoid these circumstances we need to initialize our weights more carefully and more systematically. One way of doing this is by Xavier Initialization.

If we consider a single neuron as in logistic regression, the dimension of the weight matrix is defined by the dimension of a single example. Hence we can set the variance of weights as 1/n. As we increase the dimension of input example, the former ‘s dimensions must be increased to train the model.

Once we have applied this technique to deeper neural networks, the weight initialization for each layer can be expressed as

similarly, there can be various ways to define the variance and multiplying with the randomly initialized weights.