"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
Dec. 28, 2020
Author - manisar
We saw in the last chapter how by evaluating $\partial{C}/\partial{w_i}$ and using them for gradient descent, we may be able to find the values of $w_i$'s for which $C$ is minimum. This will materialize our function and we can then use it to make predictions.
As good as it may sound, we are still far from being able to use the constructs we have built so far for real predictions. The main problem is finding $\partial{C}/\partial{w_i}$. Though, still easier than solving $\partial{C}/\partial{w_i} = 0$, it's still overwhelmingly difficult with our model that has a large number of coefficients woven into multi-layered composite functions, where each function is now not even linear anymore. There are three interesting phenomena, or call them tricks, that will get our job done.
All of these depend upon differentiation and the first two engage the chain rule of differentiation, and actually have intuitive explanations as well.
First, a little bit on the chain rule itself. If there exists a function $g(x)$ such that $x$ appears in $g$ only in a particularly modified way, say as specified by $f(x)$, we write $g = g(f(x))$. In such cases, it can be easily shown that rate of change of $g$ w.r.t. $x$ = rate of change of $g$ w.r.t. f $\times$ rate of change of $f$ w.r.t. $x$. That is to say, ${dg\over{dx}} = {dg\over{df}}.{df\over{dx}}$.
This makes sense, isn't it? For example, for a given value of $x$, $f$ might be changing very rapidly w.r.t. $x$, but if $g$ is changing very slowly w.r.t. $f$, we expect the rate of change of $g$ w.r.t. $x$ to be equal to the former suppressed by the latter.
Further if in $g$, it's not just $f$ that uses $x$, but a bunch of $f_1$, $f_2$, $f_3$ etc., then ${dg\over{dx}} = {\partial{g}\over{\partial{f_1}}}.{\partial{f_1}\over{\partial{x}}} + {\partial{g}\over{\partial{f_2}}}.{\partial{f_2}\over{\partial{x}}} +$...
We'll have some interesting intuitional analogies coming later in this chapter.
The first relation above is called the Chain Rule of Differentiation, and the second one, Multivariate Chain Rule of Differentiation
Now, let's look into the three tricks we were talking about. But before that we need to acquaint ourselves with the formal conventional terminology used in neural networks.
We shall use the following vocabulary henceforth, which is mostly universally used for depicting machine learning formulae.
Our goal is to find the $w_i$'s for which $C$ is minimum. And for that, we have to find $\partial{C}/\partial{w_i}$ iteratively. First we will find the values of these partial derivatives at random values of $w_i$'s, and then using the values of the derivatives we will modify values of $w_i$'s slightly as per the formula: $$ w_i \rightarrow w^{new}_i = w_i - \eta \frac{\partial{C}}{\partial{w_i}} $$
There are both theoretical and practical difficulties. First, let's see how we can tackle the former using differentiation rules. Later, we'll see how things like stochastic gradient descent can help us overcome the practical difficulties.
Consider this zoomed-in portion of a neural network. $a$ is the output of node $M$ that gets multiplied by the weight $w$ and goes into a function $z$. Here we have shown the node N using two circles in the middle - having $z$ and $\sigma$. The last circle having $C$ is not a node of the neural network. Remember:
$$ z_i^l=\sum_{j}w_{ij}^la_j^{l-1} $$
In our zoomed-in picture, we are looking at just one layer and $z$ at only one node, so we can ignore the superscript $l$ and the subscript $i$ for time being and simply write $ z=\sum_{j}w_{j}a_j$ where $a_j$ is understood to be the $j$th activation of the previous layer. One of these $w_j$'s and $a_i$'s are the $w$ and $a$ shown above. It is important to note that this $z$ shown above is the only function that uses this $w$. Then, this $z$ is fed into another function $\sigma(z)$, which is the output of node $N$. This output is eventually used by the $C$ function (there may be many layers between $N$ and $C$).
Applying the chain rule of differentiation:$$\begin{align}
\frac{\partial{C}}{\partial{w}} & = \frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{w}} \\
& = \frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}}.\frac{\partial{z}}{\partial{w}}
\end{align}$$
$\frac{\partial{z}}{\partial{w}}$ is simply $a$ (look the definition of $z$ above). Hence:
$$\begin{align}
\frac{\partial{C}}{\partial{w}} & = \frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}}.a \label{1} \tag{1}\end{align}$$
This is our first Mentos moment! The sensitivity of $C$ w.r.t. any $w$ is directly proportional to the output of the node ($a$) this $w$ is coming from. Although $C$ uses $w$ through so many intermediary functions and combinations, this basic relationship holds true.
This is actually something quite intuitive to us and we apply it in our everyday lives. Click the button below for exploring an analogy.
Consider this scenario. You are trying to come up with a delicious smoothie recipe that has various ingredients one of which is sugar, but not used directly. You concentrate sugar, caramelize it, add some flavors and herbs to it, and then add this intermediary product into your recipe.
The perfectionist and the keen observer in you finds (after multiple experiments) that the sweetness of the final drink is simply not directly proportional to the amount of sugar you start with. Instead, the sweetness of the drink is very sensitive to the amount of sugar when there is very no or very less sugar. But gradually the sensitivity decreases. And you are happily using this finding of yours in determining the correct amount of sugar.
But, the next time you go to buy sugar, you find that your favorite brand has launched a new type of sugar that it claims to be 5 times sweeter than their original type, and the older type is not available. So you get this new type of sugar. And, now the question is - do you have to repeat those experiments again in order to reestablish the relationship? The answer is no. The right thing to do, as is natural to think as well, is to just multiply your sensitivity results by 5. That's it. The rate at which the sweetness of your smoothie changes w.r.t. to the amount of sugar will now simply be 5 times of what it used to be.
It's the same thing that is happening above. The only way $w$ is getting itself into the system is through the multiplicative factor of $a$. Therefore, it's no surprise that the sensitivity of $C$ w.r.t. $w$ should have a factor of $a$ multiplied into it.
Let's continue. In the equation $\frac{\partial{C}}{\partial{w}} = a.\frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}}$, how do we look at the last two factors on the right hand side?
$\frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}}$ is actually $\frac{\partial{C}}{\partial{z}}$. So, we can say that this is basically how $C$ changes w.r.t. the total input $z$ at the node $w$ is connecting to. So, summarily, if we name the two nodes on the left and right of $w$ as in and out nodes respectively:
$$ \begin{align}& \text{Rate of change of } C \text{ w.r.t. any } w \\ & = \text{Output of the }\textit{in}\text{ node} \\
& \times \text{Rate of change of }C\text{ w.r.t. total input at the }\textit{out}\text{ node}
\end{align}$$
Let's look at the second factor again - $\frac{\partial{C}}{\partial{z}} = \frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}}$. This is actually used heavily in further analysis, and has been given its own identity - error at the out node ($a_{out}$), denoted as $\delta$.
Now, $\sigma$ is simply the output of the out node $N$ above, $\partial{C}/\partial{\sigma}$ is nothing but $\frac{\partial{C}}{\partial{a_{out}}}$.
Also, $\frac{\partial{\sigma}}{\partial{z}}$ can be written as $\sigma^\prime(z)$, i.e. the differential of $\sigma$ at $z_{out}$.
Thus, we have:$$\begin{align}
\delta & = \text{Rate of change of }C\text{ w.r.t. total input at the }\textit{out}\text{ node} \\
& = \frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}}\\
& = \frac{\partial{C}}{\partial{a_{out}}}.\sigma^\prime(z_{out})
\end{align}$$
Also,$$ \begin{align}
\frac{\partial{C}}{\partial{w}} = & \text{Rate of change of } C \text{ w.r.t. any } w\\
=\, & a_{in}.\frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}} \\[6pt]
=\, & \boxed{a_{in}.\delta_{out}} \label{1a} \tag{1a} \\[6pt]
=\, & a_{in}.\frac{\partial{C}}{\partial{a_{out}}}.\sigma^\prime(z_{out})
\end{align} $$
Thus,
$$ \boxed{\begin{align}& \text{Rate of change of } C \text{ w.r.t. any } w \\ & = \text{Output of the }\textit{in}\text{ node} \\
& \times \text{Rate of change of }C\text{ w.r.t. output of the }\textit{out}\text{ node} \\
& \times \text{how rapidly }\sigma\text{ is changing at the }\textit{out}\text{ node}
\end{align}}$$
The first and the last term of the three factors above are easily found out when we have the training data. Let's see what we can do for the middle term: $$
\frac{\partial{C}}{\partial{a_{out}}} = \text{Rate of change of }C\text{ w.r.t. output of the }\textit{out}\text{ node} $$
Let's look at the zoomed-in picture again with slightly more detail. Once again the smaller circles having $z$'s are actually part of the nodes shown just after them.
The output from the out node ($N$) - $a_{N}$ or $a_{out}$, shown as $o$ in the figure for the sake of clarity - is fed to multiple nodes and is actually used by the activation formulae of all the nodes in the layer $l+1$. Thus, $a_{out}$ is seen by C exclusively through all the $\sigma_i$'s (in the layer $l+1$) as shown in the figure. This realization allows us to summon the multivariate chain rule of differentiation:$$\begin{align}
\frac{\partial{C}}{\partial{a_{out}}} & = \frac{\partial{C}}{\partial{\sigma_1^{l+1}}} \frac{\partial{\sigma_1^{l+1}}}{\partial{a_{out}}} + \frac{\partial{C}}{\partial{\sigma_2^{l+1}}} \frac{\partial{\sigma_2^{l+1}}}{\partial{a_{out}}} + \cdots \\[4pt]
& = \frac{\partial{C}}{\partial{\sigma_1^{l+1}}} \frac{\partial{\sigma_1^{l+1}}}{\partial{a_{out}}} + \frac{\partial{C}}{\partial{\sigma_2^{l+1}}} \frac{\partial{\sigma_2^{l+1}}}{\partial{a_{out}}} + \cdots \\[4pt]
& = \frac{\partial{C}}{\partial{\sigma_1^{l+1}}} \frac{\partial{\sigma_1^{l+1}}}{\partial{z_1}} \frac{\partial{z_1}}{\partial{a_{out}}} + \frac{\partial{C}}{\partial{\sigma_2^{l+1}}} \frac{\partial{\sigma_2^{l+1}}}{\partial{z_2}} \frac{\partial{z_2}}{\partial{a_{out}}} + \cdots \\[4pt]
& = \left(\frac{\partial{C}}{\partial{a_1^{l+1}}} \sigma_1^{\prime}(z_1)\right) \frac{\partial{z_1}}{\partial{a_{out}}} + \left(\frac{\partial{C}}{\partial{a_2^{l+1}}} \sigma_2^{\prime}(z_2)\right) \frac{\partial{z_2}}{\partial{a_{out}}} + \cdots \ \text{(just changed symbols)} \\
\end{align}$$
This RHS of this equation is bubbling with insights. If you look at the definition of $\delta$ above, the terms in the parentheses are nothing else but $\delta$ at the corresponding nodes. And terms like $\frac{\partial{z_1}}{\partial{a_{out}}}$ are nothing else but $w^{l+1}_1$ etc. Using these two substitutions, we get:
$$\frac{\partial{C}}{\partial{a_{out}}} =\delta_1^{l+1}.w^{l+1}_1 + \delta_2^{l+1}.w^{l+1}_2 + \cdots $$
Substituting this value of $\frac{\partial{C}}{\partial{a_{out}}}$ into the definition of $\delta$: $$\begin{align}
\delta_{out} & = \frac{\partial{C}}{\partial{a_{out}}}.\sigma^\prime(z_{out}) \\[4pt]
& = \left(\delta_1^{l+1}.w^{l+1}_1 + \delta_2^{l+1}.w^{l+1}_2 + \cdots\right).\sigma^\prime(z_{out}) \\[6pt]
\text{Or, }\ \delta^l_i &= \left(\delta_1^{l+1}.w^{l+1}_{1i} + \delta_2^{l+1}.w^{l+1}_{2i} + \cdots\right).\sigma^\prime(z^l_i)\
\llap{\mathrel{\boxed{\phantom{\delta^l_i = \left(\delta_1^{l+1}.w^{l+1}_{1i} + \delta_2^{l+1}.w^{l+1}_{2i} + \cdots\right).\sigma^\prime(z^l_i)}}}} \label{2} \tag{2}
\end{align}$$
This is our second Mentos moment! This gives us a way to find $\delta^l_i$ if we know all the $\delta^{l+1}_j$'s: The error ($\delta$) at any node is the weighted sum of the errors in the nodes in the next layer multiplied by how fast $\sigma$ is changing at the given node. Thus, if can somehow find the values of $\delta$s in the last node, we can start moving back and keep on finding the $\delta$ for the inner nodes, and thus $\frac{\partial{C}}{\partial{w}} = a_{in}.\delta_{out}$ for any $w$.
This is aptly called backpropagation of errors, and is quite symmetrical to the forward propagation that normally happens in a neural network. In the latter we calculate the output of the nodes in a layer by virtue of knowing the output of the nodes in the previous layer. In the former (backpropagation), we find the errors in the nodes of the a layer by virtue of knowing the errors of the nodes of the next layer. Note that in both, we look at only the weights connected to the node in question.
Using matrix notation, we can drop the index $i$, and can write Eqn. $(2)$ succinctly for all the nodes in layer $l$: $$
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma^{\prime}(z^l) \label{2a} \tag{2a}
$$
Again, for the above, we can have a somewhat intuitive explanation. Continuing from the analogy presented above, now suppose that it was not just you but two of your friends were also secretly working on their smoothie recipe in a similar way. And guess what, they both come to your place and show their achievement to your dad just when you were doing the same. Your dad, comes up with a brilliant idea of his own - he proposes to mix all the three drinks together in order to get a super smoothie.
Now the question arises, if we were to find out how the amount of sugar in the super smoothie changes its sweetness, one way would be to get rid of sugar from all the three individual smoothies and add it directly to the super smoothie, experiment a few times, and get to the results. But there is a simpler way. We can just find out how the sweetness in the super smoothie varies w.r.t. the amount of each of the individual smoothies, multiply this term individually with the sensitivity of sweetness we got for each drink w.r.t. sugar in it, and add these three terms. Slightly convoluted, but we can somewhat see that it will work.
It's the same thing we are doing in the formula above.
We are close to be able to find $\partial{C}/\partial{w}$. The last thing remaining is to find the starting point of the backpropagation of errors. For this, we will have to go to the last layer - the output layer. From the definition of $\delta$, we have for any node, $\delta = \frac{\partial{C}}{\partial{a}}.\sigma^\prime(z)$, which, for any node $j$ in the last layer $L$ becomes $\delta^L_j = \frac{\partial{C}}{\partial{a^L_j}}.\sigma^\prime(z^L_j)$.
The formulae we have derived so far are completely general and they do not depend upon particular choices of specific functions (except for having general restrictions such as regarding smoothness, differentiability etc.). Now, in order to refine our formulae further, for the first time we need to bring in specific cost function. Our ideas still maintain their general nature, but now we are stepping into the arena of providing practical solutions. Ok, so the most common choice for cost function is the quadratic one, given by $C = \frac{1}{2}\sum_j({y_j - \hat{y}_j})^2 = \frac{1}{2}\sum_j({y_j - a^L_j})^2$. This means that $\partial{C}/\partial{a^L_j} = (y_j - a^L_j)$. With $\partial{C}/\partial{a^L_j}$ known, we can find $\delta^L_j$, and hence for all the $\delta$'s in the last layer, we have:
$$\delta^L = \nabla_{a}C \odot \sigma^{\prime}(z^L) \label{3} \tag{3} $$
This is the third Mentos moment! Now we have everything to kick-start our calculations. We start with finding $\delta^L$, i.e. the errors in the last-layer-nodes. For that we need $a^L$ which we calculate using random values of $w$'s. We also need to decide on activation functions which can be different for different layers (or even nodes). Then we move backward finding $\delta$'s using eqn. $(2)$. And finally, we use these $\delta$'s to find $\partial{C}/\partial{w}$ using eqn. $(1a)$ (calculating other $a^l$ meanwhile).
This completes backpropagation. We then start forward-propagation using gradient descent. We use the values of $\partial{C}/\partial{w}$ in finding new values of $w$'s using the eqn. in the section C2.4 What are We Trying to Do? With the new values of $w$'s, we eventually find new values of $a^L_j$'s which equips us to start backpropagation again!
We have learnt the mathematical notation and laid the theoretical foundation for materializing neural network. In particular, we have found out a way for applying gradient descent in a multilayered neural network and adjusting the weights so that can think about predicting now.
But not so fast! We need to look at some practical problems and their solutions such as stochastic gradient descent, and vanishing or exploding gradients. This we will do in the next chapter.
This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.
For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.
Supplement to section C2.3 Machine Learning Formal Notations.
A matrix (plural matrices) is a rectangular table or array of numbers, symbols, or expressions, arranged in rows and columns. For example, the dimension of the matrix below is 2 × 3 (read "two by three"), because there are two rows and three columns:
$$ \left[\begin{matrix}1 & 9 & -13\\
20 & 5 & -6
\end{matrix}\right]$$
The individual items in an $m\times n$ matrix $A$, often denoted by $a{i,j}$, where $i$ and $j$ usually vary from 1 to $m$ and $n$, respectively, are called its elements or entries. See Wikipedia... for a detailed introduction.
It's usual to write $a_{i,j}$ as $a_{ij}$. There are well-defined rules for doing all sorts of mathematical operations with matrices.
A matrix with only one row is called a row vector, and one column is called a column vector, the latter being quite commonly used for representing a mathematical vector.
The multiplication of two matrices $A$ and $B$ is defined in the following way. $$[AB]_{ij} = a_{i1}b_{1j} + a_{i2}b_{2j} + \cdots + a_{in}b_{nj} = \sum^n_{r=1}a_{ir}b_{rj}$$
That is, for finding the value of the $ij$th element of the product, we take the $i$th row of the first matrix, $j$th column of the second matrix, multiply first elements with each other, second elements with each other up to the last element and add together all the results. This means that multiplication is only possible if the number of columns in the first (left) matrix is equal to the number of rows of the second (right) matrix, and this also means that matrix multiplication is not commutative.
With the above in mind we can denote all the weights in the $l$ layer of neural network by using just one matrix:
$$ w^l = \left[\begin{matrix}
w^l_{11} & w^l_{12} & \cdots & w^l_{1k} & \cdots\\
w^l_{21} & w^l_{22} & \cdots & w^l_{2k} & \cdots\\
\cdots & \cdots & \cdots & \cdots & \cdots\\
w^l_{j1} & w^l_{j2} & \cdots & w^l_{jk} & \cdots\\
\cdots & \cdots & \cdots & \cdots & \cdots
\end{matrix}\right]$$
Getting a Transpose of a matrix means interchanging its rows and columns. E.g. in the weight matrix shown above, currently the first row represents all the weights connected to the first node of layer $l$, and so on. Its transpose will have its first row representing all the weights coming from the first node of layer $l-1$, and so on.
On the same lines, we define a bias vector $b^l$ for the layer $l$.
Finally, applying a function on a whole matrix is simply defined as equal to a matrix with the given function applied on each of its elements individually. That is,
$$f\left(
\left[\begin{matrix}
a & b \\
c & d
\end{matrix}\right]
\right) = \left[\begin{matrix}
f(a) & f(b) \\
f(c) & f(d)
\end{matrix}\right]$$
The above allows us to get the activation equation for (each node of) the layer $l$ as:
$$\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l)
\end{eqnarray}$$
Also, there is another kind of product of two matrices of same dimensions called as Hadamard product. This is nothing but element-wise product of the two matrices.
$$ (A \odot B)_{ij} = (A)_{ij}(B)_{ij} $$
Return to Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning