The Mathematics of Neural Networks – A complete example



Neural Networks are a method of artificial intelligence in which computers are taught to process data in a way similar to the human brain. Neural networks learn through being fed multiple instances of data as input, predicting an output, finding the error from the actual answer to the machine’s answer, and then fine-tuning it’s weights to reduce this error.

Whilst a neural network might seem very complex, it is actually a clever utilisation of linear algebra and multivariate calculus. This article aims to go through a full iteration of the maths that undermines a neural network.

Assumptions & Pre-Knowledge

Neural networks require a solid understanding of a college-level of calculus and linear algebra. Great refreshers can be found on Khan Academy website (linked in the previous sentence). An algorithm that is imperative for this example is gradient descentwhich is explained well in this video.

For a course that is more relevant on neural networks, this video by Adam Dhalla teaches you only the necessary areas of calculus and linear algebra that are needed for this example.

Neural Network Basics

The example we will be using is:

Typically, the input layer (green) are the input variables from a dataset, and the output layer (red) is the neural network’s prediction value. Within the hidden and output layers, a weighted sum (denoted by s) is taken for the each node, followed by the application of an activation function (denoted by a) which normalises the value according to a desired activation function.

The process of feeding data through a network from input to output is called forward propagation. The process of observing the error rate of the forward propagation and feeding the error backward to fine-tune the weights of the neural network is called back propagation. We forward propagate before we back propagate.

Forward Propagation

Note: I am using the sigmoid function as the activation function for this example (activation functions are used as mappings for inputs to be within certain ranges — in the case of sigmoid, the range is (0, 1)).

Hidden Layer

Hidden Layer 1:

Hidden Layer 2:

Hidden Layer 3:

Output Layer

Output Layer 1:

Output Layer 2:

Mean Squared Error (MSE) Calculation

The mean squared error is a measure of the difference between the expected and actual outputs. We are looking for a low MSE score, which indicates a better fit of the model to the data. We will be using gradient descent as a way to decrease this value.

Back Propagation

Now that the predicted value is calculated, the neural network needs to adjust its weights based on the prediction error. This is done through back propagation.

For this example, consider a learning rate of 0.1

The general mathematical idea behind back-propagation is to apply the chain rule to find the change in the error function over the change in a weight. Consider weight 7 for this example:

All three partial equations can be derived from our work.

Firstly,

Secondly,

And lastly,

Hence, putting all three terms together,

This formula can be done for all weights connecting the hidden layer to the output layer.

note: often authors may write the equation using a delta: δ₀₁= (a₀₁−expected₁) × a₀₁ × (1−a₀₁), so the equation can be written as ∂E₀₁ / ∂w₇ = δ₀₁ × aₕ₁

Now we have the gradient of the error function.

We want to apply gradient descent to get a new value of weight w₇. The new w₇ (we can symbolise this w₇’) can be obtained by subtracting the learning rate multiplied by the gradient from w₇.

So in general, for an output neuron:

Output Layer

Now, applying real numbers from the example to find new values of w₇ through w₁₂

Output Layer 1:

Output Layer 2:

The Hidden Layer (Derivation)

Finding a way to optimise the weights for the hidden layer has a much much larger derivation — none of this section is relevant to the calculations, so feel free to skip this part if need be.

Consider updating the weights for w₁ — in principle, updating any weight will have the same style of formula in terms of revolving around partial differentiations.

However this time around, we are a lot further away from the output neurons — hence, to find the values of the individual components of the RHS of this equation, there is going to be a lot more “chaining”…

For the first derivative:

Where:

Now, since we have calcuated δ₀₁ and δ₀₂ previously (see the calculations made in the output layer section of this article), we can substitute in the values of these deltas into the equation.

Hence, the derivative of the weighted sum with respect to the previous layer’s neurons is essentially just the corresponding weight.

Now, substituting these values in for the partial error term:

The value of ∂aₕ₁ / ∂sₕ₁ would just be the derivative of the sigmoid function

And the value of ∂sₕ₁ / ∂w₁ is the output of the previous layer neuron (which in this case, is the input layer neuron since there is only one hidden layer)

So putting it all together:

I hope you can see what has occurred in these steps — a similar working process can be done to find the formula of all of the weights (which I won’t show).

But in essence, to find the value of an updated weight, first calculate delta of the weight’s output neuron, and then subtract the old weight from the delta, multiplied by the delta, multiplied by the previous value of the weight’s input neuron.

If that is difficult to understand, then the calculations below may help you see what is occurring numerically.

The Hidden Layer (Calculation)

Previous delta values calculated:
δ₀₁ = -0.0984
δ₀₂ = 0.1479

Hidden layer 1:

Hidden layer 2:

Hidden layer 3:

And we’re done!

Neural network with updated weights

Closing Thoughts

The following was a complete example of a forward and back propagation for a neural network with 3 layers.

Typically neural networks are trained on multiple instances of data and can also be trained for multiple iterations (we call these epochs). Doing this will gradually increase/decrease the weights depending on the instance, until a neural network is optimised for a set of instances.

This process was very laborious and math heavy — thankfully that is why we have computers simulate all of this work. Libraries like PyTorch abstract many of the mathematical complexities and should definitely be used for any sort of model training.

Nonetheless, a complete walkthrough of the maths would definitely help reinforce the understanding needed when implementing this model.

Leave a Reply

Your email address will not be published. Required fields are marked *