Ch. 3: Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning^©

Jan. 5, 2021

Let's Look at some of the Practical Difficulties

Straightforward as it may seem, the idea developed and refined in the previous chapters has serious practical difficulties. We shall look into them in this chapter.

C3.1 Let's Start Dissecting the Cost vs. Weight Equation

The equation that opened the gate of Sesame (Simsim) for us has all the potential to lead us astray and fruitless. As innocent and self-explanatory this equation may look, it has a lot of mischief up its sleeve. Let's look at it again.

$$ \boxed{\begin{align}& \text{Rate of change of } C \text{ w.r.t. any } w \\ & = \text{Output of the }\textit{in}\text{ node} \\
& \times \text{Rate of change of }C\text{ w.r.t. output of the }\textit{out}\text{ node} \\
& \times \text{how rapidly }\sigma\text{ is changing at the }\textit{out}\text{ node}
\end{align}} $$

$$\begin{align}
\text{Or, } \frac{\partial{C}}{\partial{w}} & = a.\frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}} = a.\frac{\partial{C}}{\partial{\sigma}}.\sigma^{\prime}(z) \label{1} \tag{1}\end{align}$$

nocaption-Node_Weight_Bias_Activation_Cost_Function

Before we even look at the potential issues this equation is hinting at, there is something that needs to be clarified.

In the equation, $\frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}}$ is simply $\frac{\partial{C}}{\partial{z}}$. Then why don't just use the latter? It is because using the former makes our lives simpler, in fact, much simpler. Remember $\sigma$ just represents the output of a node. It is this output that gets multiplied to $w$ and is fed to other nodes. $\frac{\partial{C}}{\partial{\sigma}}$ is not only easier to calculate for the last layer, but also for the inner layers according to the equations and . Having $\frac{\partial{C}}{\partial{\sigma}}$ readily available, multiplying it to $\frac{\partial{\sigma}}{\partial{z}}$ (which is also easily calculated) makes everything work. Lastly, the $w$'s are exposed to $C$ only through $\sigma$'s. So, writing $\frac{\partial{C}}{\partial{z}}$ as $\frac{\partial{C}}{\partial{\sigma}}.\frac{\partial{\sigma}}{\partial{z}}$ is not just mental gymnastics, it is a true representation of what is going on. Spare me, I needed to explain this because you were gonna ask this later!

C3.2 Reconsidering Our Choice of the Cost Function

In the first chapter, we saw how something like a mean square error or a quadratic distance function is an acceptable choice for the cost function. These generally are and do the job nicely in some cases, but there are quite better candidates available. E.g. the quadratic cost function becomes non-convex when used with the sigmoid activation function, which means there are multiple minima which can be a real issue for our gradient descent function. Also, it does not punish the the misclassifications as much as we would want to. The nail in the coffin comes with the amazing discovery that by having it replaced by something carefully chosen, we can get rid of another unrelated unwanted effect - the slow learning because of $ \sigma^{\prime}(z^L) $. See below. And after that, if you want to look at a formal reasoning of settling on to the Cross Entropy cost function, check the supplemental section The Cross Entropy Cost Function.

C3.3 Slow Learning Because of $\sigma^{\prime}(z^L)$, or $\partial{\sigma}/\partial{z^L}$

Our favorite choice of the activation function - the sigmoid function $\sigma\left(z\right) = \frac{1}{1+e^{-z}}$, and its derivative have these graphs - shown in red and blue respectively. On looking at it, it at once becomes clear that when the output of a node ($ \sigma $) is close to extremes (0 or 1), the value of its derivative w.r.t. $ z $ ($ \sigma^{\prime} $) comes very close to zero.

But how would such small $ \sigma^{\prime} $ affect our network and learning? It does through the backpropagation - have a relook at these equations. $ \sigma^{\prime} $'s first impact $ \delta $'s which, in turn, affect $ \partial{C}/\partial{w} $: $$ \begin{align} & \href{../neural-networks-whats-happening-intuitive-intro-to-machine-learning-chapter2/#mjx-eqn\:3}{\delta^L = \nabla_{a}C \odot \sigma^{\prime}(z^L)} \label{3} \tag{3} \\[6pt] & \href{../neural-networks-whats-happening-intuitive-intro-to-machine-learning-chapter2/#mjx-eqn\:1a}{\frac{\partial{C}}{\partial{w}} = \boxed{a_{in}.\delta_{out}}} \label{1a} \tag{1a} \\[6pt] & \href{../neural-networks-whats-happening-intuitive-intro-to-machine-learning-chapter2/#mjx-eqn\:2a}{\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma^{\prime}(z^l)} \label{2a} \tag{2a} \end{align}$$ Ideally, a $\partial{C}/\partial{w}$ should have a small value when $C$ is close to a minimum w.r.t. $w$, or $\partial{C}/\partial{\sigma}$ should have a small value when $C$ is close to a minimum w.r.t the output ($\sigma$) of that node. The value of a node's output as such should not affect $\partial{C}/\partial{\sigma}$, instead its closeness to the desired output should.

Summarily, we get very slow learning for weights connected to such nodes, i.e. for nodes whose output are close to 0 or 1. And, since in classification problems, the correct outputs are, in fact, close to 0 or 1, it means that we will have very slow learning for sure.

This problem is not simply solved by changing $\sigma$ for the choice of activation function - many other functions will cause the same effect. Instead we go about changing the $C$ function from quadratic to something else! Yes, counter-intuitive as it may seem, changing $C$ is not that a big deal (especially when we have other reasons as well, see C3.2 Reconsidering Our Choice of the Cost Function), and there exist beautiful functions out there using which for $C$ simply cancels out the $\sigma^{\prime}(z)$ factor in eqn. $(3)$! So, by changing $C$, we may be able to kill two birds with a stone. Let's see how.

If you want to be able to make sense of the new cost function I'm going to present below, you need to be familiar with the Information Theory and the corresponding Entropy equation. In any case, I'll highly recommend a reading of this once when you have time - Information Theory - Rationale Behind Using Logarithm for Entropy, and Other Explanations.

For the sake of avoiding clutter, I've placed the reasoning for settling on the function shown below in the supplemental section The Cross Entropy Cost Function.

$$C(y,a) = - \frac{1}{n} \sum_x \left[ { y\ln(a) + (1-y)\ln(1-a) } \right] $$

Ok, a little peek into this function shows that it is convex (has one minimum), and that it is good in punishing misclassifications, but what about the slowness of learning being caused by the very small values of $\sigma^{\prime}(z)$?

$$ \begin{alignat}{4}
\delta^L & = \nabla_{a}C \odot \sigma^{\prime}(z^L) \\[6pt]
& = \frac{\partial{}}{\partial{a}} \left( - \frac{1}{n}\sum_x \left[{ y\ln(a) + (1-y)\ln(1-a) } \right] \right) . \sigma^{\prime}(z) && \quad \dots\text{omitting the L superscript} \\[6pt]
& = - \frac{1}{n} \frac{\partial{}}{\partial{\sigma}} \sum_x \left[ { y\ln(\sigma) + (1-y)\ln(1-\sigma) } \right] . \sigma^{\prime}(z) && \quad \dots \text{using } a=\sigma=\sigma(z) \\[6pt]
& = - \frac{1}{n} \sum_x \left[ { \frac{y}{\sigma} - \frac{(1-y)}{(1-\sigma) } } \right] . \sigma^{\prime}(z) && \\[6pt]
& = \frac{1}{n} \sum_x{ \frac{(\sigma - y) }{\sigma(1-\sigma)} . \sigma^{\prime}(z) } \tag{4}\label{4} && \\[6pt]
\text{Now, }\sigma^{\prime}(z) & = \frac{\partial}{\partial{z}} \left( \frac{1}{1+e^{-z}} \right) && \\[6pt]
& = \frac{e^{-z}}{(1+e^{-z})^2} = \sigma.\frac{e^{-z}}{(1+e^{-z})} = \sigma.\frac{1+e^{-z}-1}{(1+e^{-z})}&& \\[6pt]
& = \sigma. (1-\sigma)&& \\[6pt]
&\text{Substituting this value of } \sigma^{\prime}(z) \text{ in eqn. } (4) \text{ above:} \\[6pt]
\delta^L & = \frac{1}{n} \sum_x{ \frac{(\sigma - y) }{\sigma(1-\sigma)} . \sigma^{\prime}(z) } && \\[6pt]
& = \frac{1}{n} \sum_x{ \frac{(\sigma - y) }{\cancel{\sigma(1-\sigma)}} . \cancel{\sigma(1-\sigma) }} && \\[6pt]
&= \frac{1}{n} \sum_x{ (\sigma(z) - y) }
\end{alignat}$$

This is magic! The error term ($\delta$) now depends only on $\sigma(z) - y = a - y$. $a - y$ is nothing else but the error in the output, and so, ($\delta$), and hence $\frac{\partial{C}}{\partial{w}}$ is free of $ \sigma^{\prime}(z) $, and of the curse of it being small for extreme values of $\sigma$. Our learning doesn't need to slowdown.

So, we have solved at least one problem with a suitable choice of a cost function.

C3.4 $C$ Still Haunts... Let's Summon 'Stochastic Gradient Descent'

So much for the choice of cost function. But a significant issue still remains that is related to $C$ - it is the calculation of $\nabla C$, which is needed for calculating $\delta^L$. More specifically, it is the $\sum_x$ (where the range of $x$ is the complete training set) that appears in $\nabla C$. In order to understand this, if you may, it will help us to revise the difference between features and training inputs.

Features vs. Training Inputs

A feature is a characteristic - just one characteristic - of the input or output. E.g. for an input having $m$ independent ($x$'s) and $n$ dependent variables ($y$'s), we say our input has $m$ features and output $n$. Generally, each input feature has its impact on all the output features. That means that for calculating each output feature, we need to feed all the input features to the system.

The training data is a set of elements where each element itself is 2-element tuple. The first element of the tuple is an input entity having $m$ features, and the second element is the corresponding output having $n$ features. $n$ is generally the same as the number of nodes in the last layer. Let's assume we have $R$ examples at our disposal (i.e. the length of the training set is $R$).

The cost function $C$ spans across both the features and the inputs of the training data.

$C^p$ - for an output feature $y^p$ - is calculated using the predicted and the provided values for that output feature for all the training examples. E.g. it can be a sum or average of $(y^p_j - \hat{y}^p_j)^2$ where $j$ runs from $1$ to $R$. Let's, for simplicity, assume our output has only one feature. Now, $C$ becomes average (say) of $(y_j - \hat{y}_j)^2$ for $1<j<=R$. We actually don't need $C$ as such, we need $\partial{C_j}/\partial{a^L_j}$ - denoted by (where $j$ signifies one example at a time), but the average (or summation) part is still there.

Now, for getting better and better results, we want the training data as big as possible, but that immediately has a detrimental effect on the calculation of $C$. In every pass (iteration), we need to feed all the features present in the input of all the training examples, and use the predicted results along with the given output in all the training examples in order to calculate $C$. This can and does cause serious slowness.

To combat this issue, we bring in stochastic gradient descent. In this, instead of running $j$ from $1$ to $R$, we run it from $1$ to $s$ where $s$ is a randomly chosen subset from $R$ entities - hoping that this sample will do some justice in representing the complete training set. The sample cannot, of course, represent the complete set exactly. We make up for this by running more iterations of the complete learning process. Thus, we have, roughly speaking, turned spatial complexity of our formulae into a temporal one - in each iteration there are less inputs to be considered for calculating $\nabla C$ and other things, but there are more iterations. We keep on repeating the process until the samples collectively exhaust the training set - and then we say that an epoch of the training is complete.

Once again, in normal gradient descent, we would feed all the inputs to our neural network at once, calculate $\nabla C$, use that to improve our $w$'s and so on. In stochastic gradient descent, we feed inputs from a small randomly chosen subset of the training set to the network, learn the $w$', and move on to next subset until we exhaust the subset.

C3.5 Vanishing and Exploding Gradients

Coming soon...

This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.

For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This means you are free to copy, share and build on this book, but not to sell it.

If you're interested in commercial use, please contact me. If you would like to donate, please donate to this website.

If You Liked the Content

The page has no or minimal advertisement. If you want to contribute towards keeping this website running, think about donating a small amount. This helps in reducing the number of ads as well. You can decide the amount on the payment page. Thanks!

Open this page

Revisiting the Entropy Concept

Information Theory - Rationale Behind Using Logarithm for Entropy, and Other Explanations LearnData Science, Machine Learning, Mathematics

See the Information Theory in a new light. Understand intuitively how the information of an event naturally relates to its probability and encoding
And, it's nice to know that understanding information theory helps in getting some of the aspects of machine learning as well.

^{Location:

All Content
/

Science and Technology
/

Mathematics}

From the page

The Cross Entropy Cost Function

Supplement to sections C3.2 Reconsidering Our Choice of the Cost Function and C3.3 Slow Learning Because of $\sigma^{\prime}(z^L)$, or $\partial{\sigma}/\partial{z^L}$.

Let's work with bits (binary numbers) for the time being as those are easier to picture in our mind. This means that we will be using $log_2$. In the end, we will revert to using $\ln$.

We start with the entropy of a probability distribution (p.d.) that is given by:

$$H(p) = - \sum_i{p_i \log_2({p_i})}$$

This is actually the average number of bits we need to encode every probability state of this distribution. If a theoretical p.d. has an entropy given by the above formula, and our observational predictions for the same system are given by $q_i$'s, the average number of bits needed to encode every observational outcome is:

$$H(p,q) = - \sum_i{p_i \log_2({q_i})}$$

This is called the cross-entropy between the two p.d.'s.
If our observations were perfect - i.e. if they were exactly overlapping with the ones given by theory, this number of bits would be same as $H(p)$. So, it is the difference in the number of average bits needed for encoding in the two scenarios that can tell us how different these are.

And, this average difference in number of bits (known as K-L Divergence or relative entropy) is given by:

$$\begin{align}
D_{KL}(P||Q) & = H(p,q) - H(p)\\[6pt]
& = - \sum_i{p_i \log_2({q_i})} + \sum_i{p_i \log_2({p_i})}\\[6pt]
& = - \sum_i{p_i \frac{\log_2({q_i})}{\log_2({p_i})} }
\end{align}$$

The above is a very good measure of how different two p.d.'s are. In practice, one of the p.d.'s has the probabilities as given by theory (say correct ones), and the other has probabilities as found by observation (erroneous). Can we use it as our cost function $C$?

We do have a similar picture here. For every iteration, we get a predicted output which we can compare against the given output (from the training data). And, for classification problems, we have only two possible outcomes for each output node in each iteration - one that the node is on, i.e. it's output is $1$, and the other that it is off. The former has probability equal to $\hat{y}$, and the other has the probability $1-\hat{y}$. For these outcomes, our actual (correct) probabilities are given by $y$ and $1-y$ respectively. This means that for one iteration the cross entropy would be:

$$H(y,\hat{y}) = -y \log_2(\hat{y}) - (1-y)\ln(1-\hat{y}) $$

But what about the KL Divergence? If you look at it once again:

$$\begin{align}
D_{KL}(P||Q) & = H(p,q) - H(p)\\[6pt]
\implies H(p,q) & = D_{KL}(P||Q) + H(p)\\[6pt]
\end{align}$$

In our case, $H(p)$ is the entropy of the training data. And this doesn't change when we change our weights or outputs. That means that if we take the gradient of the $D_{KL}(P||Q)$ w.r.t. $w$'s or $a$'s, it will be equal to the gradient of $H(p,q)$. And, the only way the cost function appears in our equations is through its gradient. This means using $H(p,q)$ as our new cost function is equivalent of using $D_{KL}(P||Q)$. The former is much simpler, so let's go with that.

$$C_x(y,\hat{y}) = -y \log_2(\hat{y}) - (1-y)\ln(1-\hat{y}) $$

The $\hat{y}$'s are simply the outputs of the last layer. Having them replaced by $a$, we get, for one iteration (i.e. for a specific x and y):

$$C_x(y,a) = -y \log_2(a) - (1-y)\log_2(1-a) $$

We can now revert to using $ln$ instead of $log_2$, and for finalizing $C$ over all the iterations it will be sensible to take average :

$$C(y,a) = - \frac{1}{n} \sum_x \left[ { y\ln(a) + (1-y)\ln(1-a) } \right] $$

Return to Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning

You are here:

Table of Contents

Subscribe to Our Newsletter

Categories

Tags

Author

manisar

Ch. 3: Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning^©

Let's Look at some of the Practical Difficulties

C3.1 Let's Start Dissecting the Cost vs. Weight Equation

C3.2 Reconsidering Our Choice of the Cost Function

C3.3 Slow Learning Because of $\sigma^{\prime}(z^L)$, or $\partial{\sigma}/\partial{z^L}$

C3.4 \(C\) Still Haunts... Let's Summon 'Stochastic Gradient Descent'

C3.5 Vanishing and Exploding Gradients

If You Liked the Content

Revisiting the Entropy Concept

Information Theory - Rationale Behind Using Logarithm for Entropy, and Other Explanations LearnData Science, Machine Learning, Mathematics

The Cross Entropy Cost Function

Tell us what you think (select text for formatting, or click )

You are here:

Table of Contents

Subscribe to Our Newsletter

Categories

Tags

Author

manisar

Ch. 3: Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning ©

Let's Look at some of the Practical Difficulties

C3.1 Let's Start Dissecting the Cost vs. Weight Equation

C3.2 Reconsidering Our Choice of the Cost Function

C3.3 Slow Learning Because of $\sigma^{\prime}(z^L)$, or $\partial{\sigma}/\partial{z^L}$

C3.4 \(C\) Still Haunts... Let's Summon 'Stochastic Gradient Descent'

C3.5 Vanishing and Exploding Gradients

If You Liked the Content

Revisiting the Entropy Concept

Information Theory - Rationale Behind Using Logarithm for Entropy, and Other Explanations LearnData Science, Machine Learning, Mathematics

The Cross Entropy Cost Function

Tell us what you think (select text for formatting, or click here)

Ch. 3: Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning^©

Tell us what you think (select text for formatting, or click )