We use cookies to enhance your browsing experience. By browsing our website, you consent to our use of cookies and other analytics technologies.
"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
Jan. 5, 2021
Author - manisar
Straightforward as it may seem, the idea developed and refined in the previous chapters has serious practical difficulties. We shall look into them in this chapter.
The equation that opened the gate of Sesame (Simsim) for us has all the potential to lead us astray and fruitless. As innocent and self-explanatory this equation may look, it has a lot of mischief up its sleeve. Let's look at it again.
Before we even look at the potential issues this equation is hinting at, there is something that needs to be clarified.
In the first chapter, we saw how something like a mean square error or a quadratic distance function is an acceptable choice for the cost function. These generally are and do the job nicely in some cases, but there are quite better candidates available. E.g. the quadratic cost function becomes non-convex when used with the sigmoid activation function, which means there are multiple minima which can be a real issue for our gradient descent function. Also, it does not punish the the misclassifications as much as we would want to. The nail in the coffin comes with the amazing discovery that by having it replaced by something carefully chosen, we can get rid of another unrelated unwanted effect - the slow learning because of
Summarily, we get very slow learning for weights connected to such nodes, i.e. for nodes whose output are close to 0 or 1. And, since in classification problems, the correct outputs are, in fact, close to 0 or 1, it means that we will have very slow learning for sure.
This problem is not simply solved by changing
If you want to be able to make sense of the new cost function I'm going to present below, you need to be familiar with the Information Theory and the corresponding Entropy equation. In any case, I'll highly recommend a reading of this once when you have time - Information Theory - Rationale Behind Using Logarithm for Entropy, and Other Explanations.
For the sake of avoiding clutter, I've placed the reasoning for settling on the function shown below in the supplemental section The Cross Entropy Cost Function↕.
Ok, a little peek into this function shows that it is convex (has one minimum), and that it is good in punishing misclassifications, but what about the slowness of learning being caused by the very small values of
This is magic! The error term (
So, we have solved at least one problem with a suitable choice of a cost function.
So much for the choice of cost function. But a significant issue still remains that is related to
A feature is a characteristic - just one characteristic - of the input or output. E.g. for an input having
The training data is a set of elements where each element itself is 2-element tuple. The first element of the tuple is an input entity having
The cost function
Now, for getting better and better results, we want the training data as big as possible, but that immediately has a detrimental effect on the calculation of
To combat this issue, we bring in stochastic gradient descent. In this, instead of running
Once again, in normal gradient descent, we would feed all the inputs to our neural network at once, calculate
Coming soon...
This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.
For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.
Learn about LSTMs, and see why they work the way they do by interacting with one!
It's astonishing to see that by using a very simple mechanism, we can somewhat generate the pattern long and short term memory are supposed to follow.
If you already know about LSTMs, you may want to jump to this interactive graph that shows how the cell state and the hidden state change depending upon the previous states and current input.
RNNs introduced the idea of hidden state which can be seen as the memory of a cell. As revolutionary as it may seem at the outset, this suffers from a major shortcoming.
The RNN cell remembers …
The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples.
This is considered to be the Hello World of machine learning programming. That is, any learner's first machine learning model is ought to be the one that can tell the numeral digits by looking at their shape.
Here, I've used a pre-trained model to predict digits from 0 to 9 that you may draw on the canvas below.
Many thanks …
In this chapter, we'll be looking at the description of recurrent neural network (RNN).
RNNs are normal neural networks with memory. Let us see what motivates this idea, and how it unfolds.
We have seen how we train a normal dense neural network and then feed it inputs - each consisting of a set of features - in order to get a predicted output. For a given input, its features are the only thing that determine the output we are going to get for it (once the weights …
In this chapter, we will be sharpening our theoretical tools and sneak our way into the mathematics of neural networks.
From traditional regression to neural networks - it's not that big a leap as you might think. In this book, let's get a peek into this transition while appreciating how animal kingdom is already using this strategy. We will be taking help from our friend - intuition - time and again.
We saw in the last chapter how by evaluating
As good as it may sound, we are still far from being able to use the constructs we have built so far for real predictions. The main problem is …
In this chapter, we will be looking at the basics - the idea of prediction, using traditional regression and moving towards learning based methods.
From traditional regression to neural networks - it's not that big a leap as you might think. In this book, let's get a peek into this transition while appreciating how animal kingdom is already using this strategy. We will be taking help from our friend - intuition - time and again.
Machine learning is about prediction.
In a sense, all of computer programming is that - in pretty much all the code that we write, we want to get to a value (or a set of values) by feeding input to our code. But, since we use deterministic formulae to calculate the exact results, we generally don't call it prediction. In such cases, there exists a perfect, known relationship between the input and the output, and …
See the Information Theory in a new light. Understand intuitively how the information of an event naturally relates to its probability and encoding
And, it's nice to know that understanding information theory helps in getting some of the aspects of machine learning as well.
If you work in, or have interest in Information Technology or Mathematics, then Information Theory is something you must be acquainted with. If you haven't heard of it, check the section Information Theory - Quick Introduction on this page. What I am providing below is an intuitive build-up which is generally not present in most explanations of the Information Theory - including the one that came from its founder.
If you are in a hurry, you may …
Supplement to sections C3.2 Reconsidering Our Choice of the Cost Function↕ and C3.3 Slow Learning Because of
Let's work with bits (binary numbers) for the time being as those are easier to picture in our mind. This means that we will be using
We start with the entropy of a probability distribution (p.d.) that is given by:
This is actually the average number of bits we need to encode every probability state of this distribution. If a theoretical p.d. has an entropy given by the above formula, and our observational predictions for the same system are given by
This is called the cross-entropy between the two p.d.'s.
If our observations were perfect - i.e. if they were exactly overlapping with the ones given by theory, this number of bits would be same as
And, this average difference in number of bits (known as K-L Divergence or relative entropy) is given by:
The above is a very good measure of how different two p.d.'s are. In practice, one of the p.d.'s has the probabilities as given by theory (say correct ones), and the other has probabilities as found by observation (erroneous). Can we use it as our cost function
We do have a similar picture here. For every iteration, we get a predicted output which we can compare against the given output (from the training data). And, for classification problems, we have only two possible outcomes for each output node in each iteration - one that the node is on, i.e. it's output is
But what about the KL Divergence? If you look at it once again:
In our case,
The
We can now revert to using
Return to Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning