A simplified description for Understanding the Mathematics of Deep Learning

Introduction

neuron 1 (top):

neuron 2 (bottom):

Rewriting this into a matrix type we will get:

Understanding the significance of non-linearity

In order to start determining error gradients, initially, we have to calculate the error (i.e. the general loss). For a function f( x) making up of 3 functions A, B and C– we have

f( x) = A( B( C( x)))

The chain guideline tells us that the derivative of this function equates to:

Gradient descent is an iterative optimization algorithm which is used to discover the local minima or global minima of a function. For the figure listed below, W1 = 3 and w2 = 1

Reference and image source – neural networks and backpropagation explained in a basic way

Training a Neural Network

Training steps

If you can work with a few easy principles of Maths such as partial derivatives and the Chain Rule, you could acquire a much deeper understanding of the functions of a Deep Learning networks

A combination of Gradient descent and backpropagation algorithms are used to train a neural network i.e. to minimise the total loss function

The overall steps are

1. Forward propagate the data points through the network get the outputs
2. Use loss function to calculate the total error
3. Use backpropagation algorithm to calculate the gradient of the loss function with respect to each weight and bias
4. Use Gradient descent to update the weights and biases at each layer
5. Repeat above steps to minimize the total error.

Hence, in a single sentence we are essentially propagating the total error backward through the connections in the network layer by layer, calculate the contribution (gradient) of each weight and bias to the total error in every layer, then use gradient descent algorithm to optimize the weights and biases, and eventually minimize the total error of the neural network.

Explaining the forward pass and the backward pass

Forward Pass

Forward pass is basically a set of operations which transform network input into the output space. During the inference stage neural network relies solely on the forward pass. Let’s consider a simple neural network with 2-hidden layers. Here we assume that each neuron, except the neurons in the last layers, uses ReLU activation function (the last layer uses softmax).

neuron 1 (top):

neuron 2 (bottom):

Rewriting this into a matrix form we will get:

Backward Pass

In order to start calculating error gradients, first, we have to calculate the error (i.e. the overall loss). We can view the whole neural network as a composite function (a function comprising of other functions). Using the Chain Rule, we can find the derivative of a composite function. This gives us the individual gradients. In other words, we can use the Chain rule to apportion the total error to the various layers of the neural network. This represents the gradient that will be minimised using Gradient Descent.

Reference and image source: under the hood of neural networks part 1 fully connected

A recap of the Chain Rule and Partial Derivatives

We can thus see the process of training a neural network as a combination of Back propagation and Gradient descent. These two algorithms can be explained by understanding the Chain Rule and Partial Derivatives.

The Chain Rule

The chain rule is a formula for calculating the derivatives of composite functions. Composite functions are functions composed of functions inside other function(s). Given a composite function f(x) = h(g(x)), the derivative of f(x)  is given by the chain rule as

You can also extend this idea to more than two functions. For example, for a function f(x) comprising of three functions A, B and C – we have

f(x) = A(B(C(x)))

The chain rule tells us that the derivative of this function equals:

Gradient descent is an iterative optimization algorithm which is used to find the local minima or global minima of a function. The algorithm works using the following steps

1. We start from a point on the graph of a function
2. We find the direction from that point, in which the function decreases fastest
3. We travel (down along the path) indicated by this direction in a small step to arrive at a new point

The slope of a line at a specific point is represented by its derivative. However, since we are concerned with two or more variables (weights and biases), we need to consider the partial derivatives. Hence, a gradient is a vector that stores the partial derivatives of multivariable functions. It helps us calculate the slope at a specific point on a curve for functions with multiple independent variables. We need to consider partial derivatives because for complex(multivariable) functions, we need to determine the impact of each individual variable on the overall derivative.  Consider a function of two variables x and z. If we change x, but hold all other variables constant, we get one partial derivative. If we change z, but hold x constant, we get another partial derivative. The combination represents the full derivative of the multivariable function.

Thus, the entire neural network training can be seen as a combination of the chain rule and partial derivatives

Reference and image source – neural networks and backpropagation explained in a simple way

Big Picture

To recap, considering the big picture – we have the starting point of errors, which is the loss function. This figure shows the process of backpropagating errors following this schemas:
Input -> Forward calls -> Loss function -> derivative -> backpropagation of errors. At each stage we get the deltas on the weights of this stage. For the figure below, W1 = 3 and w2 = 1

Reference and image source – neural networks and backpropagation explained in a simple way

Conclusion

If you can work with a few simple concepts of Maths such as partial derivatives and the Chain Rule, you could gain a deeper understanding of the workings of a Deep Learning networks