"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
Jan. 5, 2021
Author - manisar
Straightforward as it may seem, the idea developed and refined in the previous chapters has serious practical difficulties. We shall look into them in this chapter.
The equation that opened the gate of Sesame (Simsim) for us has all the potential to lead us astray and fruitless. As innocent and self-explanatory this equation may look, it has a lot of mischief up its sleeve. Let's look at it again.
$$ \boxed{\begin{align}& \text{Rate of change of } C \text{ w.r.t. any } w \\ & = \text{Output of the }\textit{in}\text{ node} \\
& \times \text{Rate of change of }C\text{ w.r.t. output of the }\textit{out}\text{ node} \\
& \times \text{how rapidly }\sigma\text{ is changing at the }\textit{out}\text{ node}
\end{align}} $$
$$\begin{align}
\text{Or, } \frac{\partial{C}}{\partial{w}} & = a.\frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}} = a.\frac{\partial{C}}{\partial{\sigma}}.\sigma^{\prime}(z) \label{1} \tag{1}\end{align}$$
Before we even look at the potential issues this equation is hinting at, there is something that needs to be clarified.
In the first chapter, we saw how something like a mean square error or a quadratic distance function is an acceptable choice for the cost function. These generally are and do the job nicely in some cases, but there are quite better candidates available. E.g. the quadratic cost function becomes non-convex when used with the sigmoid activation function, which means there are multiple minima which can be a real issue for our gradient descent function. Also, it does not punish the the misclassifications as much as we would want to. The nail in the coffin comes with the amazing discovery that by having it replaced by something carefully chosen, we can get rid of another unrelated unwanted effect - the slow learning because of \( \sigma^{\prime}(z^L) \). See below. And after that, if you want to look at a formal reasoning of settling on to the Cross Entropy cost function, check the supplemental section The Cross Entropy Cost Function.
Summarily, we get very slow learning for weights connected to such nodes, i.e. for nodes whose output are close to 0 or 1. And, since in classification problems, the correct outputs are, in fact, close to 0 or 1, it means that we will have very slow learning for sure.
This problem is not simply solved by changing $\sigma$ for the choice of activation function - many other functions will cause the same effect. Instead we go about changing the $C$ function from quadratic to something else! Yes, counter-intuitive as it may seem, changing $C$ is not that a big deal (especially when we have other reasons as well, see C3.2 Reconsidering Our Choice of the Cost Function), and there exist beautiful functions out there using which for $C$ simply cancels out the $\sigma^{\prime}(z)$ factor in eqn. \((3)\)! So, by changing $C$, we may be able to kill two birds with a stone. Let's see how.
If you want to be able to make sense of the new cost function I'm going to present below, you need to be familiar with the Information Theory and the corresponding Entropy equation. In any case, I'll highly recommend a reading of this once when you have time - Information Theory - Rationale Behind Using Logarithm for Entropy, and Other Explanations.
For the sake of avoiding clutter, I've placed the reasoning for settling on the function shown below in the supplemental section The Cross Entropy Cost Function.
$$C(y,a) = - \frac{1}{n} \sum_x \left[ { y\ln(a) + (1-y)\ln(1-a) } \right] $$
Ok, a little peek into this function shows that it is convex (has one minimum), and that it is good in punishing misclassifications, but what about the slowness of learning being caused by the very small values of \(\sigma^{\prime}(z)\)?
$$ \begin{alignat}{4}
\delta^L & = \nabla_{a}C \odot \sigma^{\prime}(z^L) \\[6pt]
& = \frac{\partial{}}{\partial{a}} \left( - \frac{1}{n}\sum_x \left[{ y\ln(a) + (1-y)\ln(1-a) } \right] \right) . \sigma^{\prime}(z) && \quad \dots\text{omitting the L superscript} \\[6pt]
& = - \frac{1}{n} \frac{\partial{}}{\partial{\sigma}} \sum_x \left[ { y\ln(\sigma) + (1-y)\ln(1-\sigma) } \right] . \sigma^{\prime}(z) && \quad \dots \text{using } a=\sigma=\sigma(z) \\[6pt]
& = - \frac{1}{n} \sum_x \left[ { \frac{y}{\sigma} - \frac{(1-y)}{(1-\sigma) } } \right] . \sigma^{\prime}(z) && \\[6pt]
& = \frac{1}{n} \sum_x{ \frac{(\sigma - y) }{\sigma(1-\sigma)} . \sigma^{\prime}(z) } \tag{4}\label{4} && \\[6pt]
\text{Now, }\sigma^{\prime}(z) & = \frac{\partial}{\partial{z}} \left( \frac{1}{1+e^{-z}} \right) && \\[6pt]
& = \frac{e^{-z}}{(1+e^{-z})^2} = \sigma.\frac{e^{-z}}{(1+e^{-z})} = \sigma.\frac{1+e^{-z}-1}{(1+e^{-z})}&& \\[6pt]
& = \sigma. (1-\sigma)&& \\[6pt]
&\text{Substituting this value of } \sigma^{\prime}(z) \text{ in eqn. } (4) \text{ above:} \\[6pt]
\delta^L & = \frac{1}{n} \sum_x{ \frac{(\sigma - y) }{\sigma(1-\sigma)} . \sigma^{\prime}(z) } && \\[6pt]
& = \frac{1}{n} \sum_x{ \frac{(\sigma - y) }{\cancel{\sigma(1-\sigma)}} . \cancel{\sigma(1-\sigma) }} && \\[6pt]
&= \frac{1}{n} \sum_x{ (\sigma(z) - y) }
\end{alignat}$$
This is magic! The error term (\(\delta\)) now depends only on \(\sigma(z) - y = a - y\). \(a - y\) is nothing else but the error in the output, and so, (\(\delta\)), and hence \(\frac{\partial{C}}{\partial{w}}\) is free of \( \sigma^{\prime}(z) \), and of the curse of it being small for extreme values of \(\sigma\). Our learning doesn't need to slowdown.
So, we have solved at least one problem with a suitable choice of a cost function.
So much for the choice of cost function. But a significant issue still remains that is related to \(C\) - it is the calculation of \(\nabla C\), which is needed for calculating $\delta^L$. More specifically, it is the \(\sum_x\) (where the range of \(x\) is the complete training set) that appears in \(\nabla C\). In order to understand this, if you may, it will help us to revise the difference between features and training inputs.
A feature is a characteristic - just one characteristic - of the input or output. E.g. for an input having $m$ independent ($x$'s) and $n$ dependent variables ($y$'s), we say our input has $m$ features and output $n$. Generally, each input feature has its impact on all the output features. That means that for calculating each output feature, we need to feed all the input features to the system.
The training data is a set of elements where each element itself is 2-element tuple. The first element of the tuple is an input entity having $m$ features, and the second element is the corresponding output having $n$ features. $n$ is generally the same as the number of nodes in the last layer. Let's assume we have $R$ examples at our disposal (i.e. the length of the training set is $R$).
The cost function $C$ spans across both the features and the inputs of the training data.
$C^p$ - for an output feature $y^p$ - is calculated using the predicted and the provided values for that output feature for all the training examples. E.g. it can be a sum or average of $(y^p_j - \hat{y}^p_j)^2$ where $j$ runs from $1$ to $R$. Let's, for simplicity, assume our output has only one feature. Now, $C$ becomes average (say) of $(y_j - \hat{y}_j)^2$ for $1<j<=R$. We actually don't need $C$ as such, we need $\partial{C_j}/\partial{a^L_j}$ - denoted by (where $j$ signifies one example at a time), but the average (or summation) part is still there.
Now, for getting better and better results, we want the training data as big as possible, but that immediately has a detrimental effect on the calculation of $C$. In every pass (iteration), we need to feed all the features present in the input of all the training examples, and use the predicted results along with the given output in all the training examples in order to calculate $C$. This can and does cause serious slowness.
To combat this issue, we bring in stochastic gradient descent. In this, instead of running $j$ from $1$ to $R$, we run it from $1$ to $s$ where $s$ is a randomly chosen subset from $R$ entities - hoping that this sample will do some justice in representing the complete training set. The sample cannot, of course, represent the complete set exactly. We make up for this by running more iterations of the complete learning process. Thus, we have, roughly speaking, turned spatial complexity of our formulae into a temporal one - in each iteration there are less inputs to be considered for calculating $\nabla C$ and other things, but there are more iterations. We keep on repeating the process until the samples collectively exhaust the training set - and then we say that an epoch of the training is complete.
Once again, in normal gradient descent, we would feed all the inputs to our neural network at once, calculate $\nabla C$, use that to improve our $w$'s and so on. In stochastic gradient descent, we feed inputs from a small randomly chosen subset of the training set to the network, learn the $w$', and move on to next subset until we exhaust the subset.
Coming soon...
This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.
For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.
See the Information Theory in a new light. Understand intuitively how the information of an event naturally relates to its probability and encoding
And, it's nice to know that understanding information theory helps in getting some of the aspects of machine learning as well.
Supplement to sections C3.2 Reconsidering Our Choice of the Cost Function and C3.3 Slow Learning Because of $\sigma^{\prime}(z^L)$, or $\partial{\sigma}/\partial{z^L}$.
Let's work with bits (binary numbers) for the time being as those are easier to picture in our mind. This means that we will be using \(log_2\). In the end, we will revert to using \(\ln\).
We start with the entropy of a probability distribution (p.d.) that is given by:
$$H(p) = - \sum_i{p_i \log_2({p_i})}$$
This is actually the average number of bits we need to encode every probability state of this distribution. If a theoretical p.d. has an entropy given by the above formula, and our observational predictions for the same system are given by \(q_i\)'s, the average number of bits needed to encode every observational outcome is:
$$H(p,q) = - \sum_i{p_i \log_2({q_i})}$$
This is called the cross-entropy between the two p.d.'s.
If our observations were perfect - i.e. if they were exactly overlapping with the ones given by theory, this number of bits would be same as \(H(p)\). So, it is the difference in the number of average bits needed for encoding in the two scenarios that can tell us how different these are.
And, this average difference in number of bits (known as K-L Divergence or relative entropy) is given by:
$$\begin{align}
D_{KL}(P||Q) & = H(p,q) - H(p)\\[6pt]
& = - \sum_i{p_i \log_2({q_i})} + \sum_i{p_i \log_2({p_i})}\\[6pt]
& = - \sum_i{p_i \frac{\log_2({q_i})}{\log_2({p_i})} }
\end{align}$$
The above is a very good measure of how different two p.d.'s are. In practice, one of the p.d.'s has the probabilities as given by theory (say correct ones), and the other has probabilities as found by observation (erroneous). Can we use it as our cost function \(C\)?
We do have a similar picture here. For every iteration, we get a predicted output which we can compare against the given output (from the training data). And, for classification problems, we have only two possible outcomes for each output node in each iteration - one that the node is on, i.e. it's output is \(1\), and the other that it is off. The former has probability equal to \(\hat{y}\), and the other has the probability \(1-\hat{y}\). For these outcomes, our actual (correct) probabilities are given by \(y\) and \(1-y\) respectively. This means that for one iteration the cross entropy would be:
$$H(y,\hat{y}) = -y \log_2(\hat{y}) - (1-y)\ln(1-\hat{y}) $$
But what about the KL Divergence? If you look at it once again:
$$\begin{align}
D_{KL}(P||Q) & = H(p,q) - H(p)\\[6pt]
\implies H(p,q) & = D_{KL}(P||Q) + H(p)\\[6pt]
\end{align}$$
In our case, \(H(p)\) is the entropy of the training data. And this doesn't change when we change our weights or outputs. That means that if we take the gradient of the \(D_{KL}(P||Q)\) w.r.t. \(w\)'s or \(a\)'s, it will be equal to the gradient of \(H(p,q)\). And, the only way the cost function appears in our equations is through its gradient. This means using \(H(p,q)\) as our new cost function is equivalent of using \(D_{KL}(P||Q)\). The former is much simpler, so let's go with that.
$$C_x(y,\hat{y}) = -y \log_2(\hat{y}) - (1-y)\ln(1-\hat{y}) $$
The \(\hat{y}\)'s are simply the outputs of the last layer. Having them replaced by \(a\), we get, for one iteration (i.e. for a specific x and y):
$$C_x(y,a) = -y \log_2(a) - (1-y)\log_2(1-a) $$
We can now revert to using \(ln\) instead of \(log_2\), and for finalizing \(C\) over all the iterations it will be sensible to take average :
$$C(y,a) = - \frac{1}{n} \sum_x \left[ { y\ln(a) + (1-y)\ln(1-a) } \right] $$
Return to Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning