"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
Sept. 6, 2021
Author - manisar
If you already know about LSTMs, you may want to jump to this interactive graph that shows how the cell state and the hidden state change depending upon the previous states and current input.
RNNs introduced the idea of hidden state which can be seen as the memory of a cell. As revolutionary as it may seem at the outset, this suffers from a major shortcoming.
The RNN cell remembers only its last state, which could be drastically different from its previous-to-last states depending upon the last input.
So, e.g. let's say we want to generate IMDB-like scores by reading movie reviews, and we got something like:
The acting was very bad, direction was Ok, but I loved the movie.
RNN would probably rate it 5/5 because of the presence of the strong word loved appearing in the latter half in the sentence.
We know that the score, though positive, should be less than 5, perhaps 4.5 or 4 because there is a very strong dislike sentiment in the beginning - the direction was very bad.
RNNs face problem with long long-range dependencies, i.e. where the gap between the relevant information (context), and the place where it's needed is very large, e.g. predicting a score from the review The acting was very bad, direction was Ok, but I loved the movie would not be much correct if done by an RNN.
It's beyond any doubt that for any serious memory based predictions or classifications, we need some kind of a longer term memory.
That's what LSTM brings to the table. LSTM is a modified RNN that has a long term memory in addition to the short term memory.
Before we go into the specifics, let's see what we can deduce from theory itself.
For a normal NN (neural network cell), we know that its output is governed by only one thing (for a given set of weights and biases):
Then, we bring in RNNs. An RNN cell's output is its short term memory (STM) - or its hidden state $h_t$. And this STM is modified by two things:
Now, we have introduced another factor - the LTM or long term memory. The LSTM cell has two outputs - STM and LTM, the latter being also known as cell state $c_t$. Naturally, the LTM should be influenced by:
If you think about memory in human beings, isn't this how we remember things?
Current inputs affect our short term memory, and the short term memory keeps on affecting our long term memory (the current input does not directly affect LTM).
We can start with something on the same lines with LTM.
The line at the top represents LTM or the long term memory of the cell ($c_t$). The bottom line represents STM or the short term memory of the cell ($h_t$).
At this point we just know that we have to make STM affect LTM, but in what way, and using what techniques, is unknown, and hence is denoted by a Black Box.
Ok, what can we do in the black box?
Based on STM, there are two things we can do to the LTM:
Something like this:
Now, if we look deeper, unlike forgetting, learning for LTM is actually comprised of two steps.
In forgetting, we would just ask LTM to forget some of its current state.
In learning, we would ask LTM to learn some of something new. Not only do we tell how much to learn, we give LTM the new object of learning as well. We also use the LTM to adjust the final state of STM before outputting it. Our cell should now look like this:
For deciding the degree of both forgetting and learning, we use sigmoid ($sig$) layer - this layer squishes STM or $h_t$ between 0 and 1. For telling LTM what to learn, we use $tanh$ layer. This layer suppresses STM between -1 and 1.
Lastly we do some adjustment of STM based on our revised LTM, and this is our final picture:
The three dashed rectangles shown in the diagram are aptly called the "forget gate", "input gate", and "output gate" respectively.
Component-wise operation means adding or multiplying the respective components of the inputs.
This means that a component-wise multiplication (used in the forget gate) with a vector having most of its components as zero will result in an almost-zero vector. Thus, the forget gate helps in deciding how much to forget.
The input gate uses component-wise addition (with LTM), and this it adds or subtracts something from the current value of LTM.
This ingenious way of using layers and operations results in a behavior that is very natural and logical to think about. Look at the table below to see if this makes sense to you.
This table uses the values we get if we use these functions and gates to predict the two states (cell and hidden) using the given states and inputs.
You can see how these values change according to the STM, input and LTM in the graph given here.
Note that:
STM | FG | IG | Forget | Learn | Comment |
---|---|---|---|---|---|
+VE | 1 | 1 | Not | Much | Current LTM needs big boost - does not forget, and learns more |
+ve | $\approx$ 1 | $\approx$ +0 | Little | Little | Current LTM still gets a boost - forgets a little, learns a little (common pathway) |
-ve | $\approx$ +0 | $\approx$ -0 | Little more than +ve | Little | Current LTM needs to be slightly tweaked - forgets a little more, learns a little (common pathway) |
-VE | 0 | 0 | All | Little | Current LTM needs to be modified a lot - forgets almost everything, and starts learning afresh |
This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.
For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.
Here you can observe the functioning of an LSTM by interacting with the graph given below.
Drag the green point Prev. STM + input - which is the current STM or the hidden state $h_t$ - and the purple point Prev. LTM - which is the previous cell state ($c_{t-1}$) - vertically to see how they affect the different parameters of an LSTM. You can modify the gates (I mean their outputs) as well.
The way these values change (along with the outputs from the gates) is quite in line with what common sense would suggest - as described in the table below.
Note that:
Return to Types of Neural Networks