We use cookies to enhance your browsing experience. By browsing our website, you consent to our use of cookies and other analytics technologies.
"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
Dec. 28, 2020
Author - manisar
We saw in the last chapter how by evaluating
As good as it may sound, we are still far from being able to use the constructs we have built so far for real predictions. The main problem is finding
All of these depend upon differentiation and the first two engage the chain rule of differentiation, and actually have intuitive explanations as well.
First, a little bit on the chain rule itself. If there exists a function
This makes sense, isn't it? For example, for a given value of
Further if in
We'll have some interesting intuitional analogies coming later in this chapter.
The first relation above is called the Chain Rule of Differentiation, and the second one, Multivariate Chain Rule of Differentiation
Now, let's look into the three tricks we were talking about. But before that we need to acquaint ourselves with the formal conventional terminology used in neural networks.
We shall use the following vocabulary henceforth, which is mostly universally used for depicting machine learning formulae.
Our goal is to find the
There are both theoretical and practical difficulties. First, let's see how we can tackle the former using differentiation rules. Later, we'll see how things like stochastic gradient descent can help us overcome the practical difficulties.
Consider this zoomed-in portion of a neural network.
In our zoomed-in picture, we are looking at just one layer and
Applying the chain rule of differentiation:
This is our first Mentos moment! The sensitivity of
This is actually something quite intuitive to us and we apply it in our everyday lives. Click the button below for exploring an analogy.
Consider this scenario. You are trying to come up with a delicious smoothie recipe that has various ingredients one of which is sugar, but not used directly. You concentrate sugar, caramelize it, add some flavors and herbs to it, and then add this intermediary product into your recipe.
The perfectionist and the keen observer in you finds (after multiple experiments) that the sweetness of the final drink is simply not directly proportional to the amount of sugar you start with. Instead, the sweetness of the drink is very sensitive to the amount of sugar when there is very no or very less sugar. But gradually the sensitivity decreases. And you are happily using this finding of yours in determining the correct amount of sugar.
But, the next time you go to buy sugar, you find that your favorite brand has launched a new type of sugar that it claims to be 5 times sweeter than their original type, and the older type is not available. So you get this new type of sugar. And, now the question is - do you have to repeat those experiments again in order to reestablish the relationship? The answer is no. The right thing to do, as is natural to think as well, is to just multiply your sensitivity results by 5. That's it. The rate at which the sweetness of your smoothie changes w.r.t. to the amount of sugar will now simply be 5 times of what it used to be.
It's the same thing that is happening above. The only way
Let's continue. In the equation
Let's look at the second factor again -
Now,
Also,
Thus, we have:
Also,
Thus,
The first and the last term of the three factors above are easily found out when we have the training data. Let's see what we can do for the middle term:
Let's look at the zoomed-in picture again with slightly more detail. Once again the smaller circles having
The output from the out node (
This RHS of this equation is bubbling with insights. If you look at the definition of
Substituting this value of
This is our second Mentos moment! This gives us a way to find
This is aptly called backpropagation of errors, and is quite symmetrical to the forward propagation that normally happens in a neural network. In the latter we calculate the output of the nodes in a layer by virtue of knowing the output of the nodes in the previous layer. In the former (backpropagation), we find the errors in the nodes of the a layer by virtue of knowing the errors of the nodes of the next layer. Note that in both, we look at only the weights connected to the node in question.
Using matrix notation↕, we can drop the index
Again, for the above, we can have a somewhat intuitive explanation. Continuing from the analogy presented above, now suppose that it was not just you but two of your friends were also secretly working on their smoothie recipe in a similar way. And guess what, they both come to your place and show their achievement to your dad just when you were doing the same. Your dad, comes up with a brilliant idea of his own - he proposes to mix all the three drinks together in order to get a super smoothie.
Now the question arises, if we were to find out how the amount of sugar in the super smoothie changes its sweetness, one way would be to get rid of sugar from all the three individual smoothies and add it directly to the super smoothie, experiment a few times, and get to the results. But there is a simpler way. We can just find out how the sweetness in the super smoothie varies w.r.t. the amount of each of the individual smoothies, multiply this term individually with the sensitivity of sweetness we got for each drink w.r.t. sugar in it, and add these three terms. Slightly convoluted, but we can somewhat see that it will work.
It's the same thing we are doing in the formula above.
We are close to be able to find
The formulae we have derived so far are completely general and they do not depend upon particular choices of specific functions (except for having general restrictions such as regarding smoothness, differentiability etc.). Now, in order to refine our formulae further, for the first time we need to bring in specific cost function. Our ideas still maintain their general nature, but now we are stepping into the arena of providing practical solutions. Ok, so the most common choice for cost function is the quadratic one, given by
This is the third Mentos moment! Now we have everything to kick-start our calculations. We start with finding
This completes backpropagation. We then start forward-propagation using gradient descent. We use the values of
We have learnt the mathematical notation and laid the theoretical foundation for materializing neural network. In particular, we have found out a way for applying gradient descent in a multilayered neural network and adjusting the weights so that can think about predicting now.
But not so fast! We need to look at some practical problems and their solutions such as stochastic gradient descent, and vanishing or exploding gradients. This we will do in the next chapter.
This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.
For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.
Supplement to section C2.3 Machine Learning Formal Notations↕.
A matrix (plural matrices) is a rectangular table or array of numbers, symbols, or expressions, arranged in rows and columns. For example, the dimension of the matrix below is 2 × 3 (read "two by three"), because there are two rows and three columns:
The individual items in an
It's usual to write
A matrix with only one row is called a row vector, and one column is called a column vector, the latter being quite commonly used for representing a mathematical vector.
The multiplication of two matrices
That is, for finding the value of the
With the above in mind we can denote all the weights in the
Getting a Transpose of a matrix means interchanging its rows and columns. E.g. in the weight matrix shown above, currently the first row represents all the weights connected to the first node of layer
On the same lines, we define a bias vector
Finally, applying a function on a whole matrix is simply defined as equal to a matrix with the given function applied on each of its elements individually. That is,
The above allows us to get the activation equation for (each node of) the layer
Also, there is another kind of product of two matrices of same dimensions called as Hadamard product. This is nothing but element-wise product of the two matrices.
Return to Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning