Ch. 4.2: Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning^©

Sept. 6, 2021

LSTM or Long-Short Term Memory - Learn by Interacting

If you already know about LSTMs, you may want to jump to this interactive graph that shows how the cell state and the hidden state change depending upon the previous states and current input.

C4.2.1 RNNs are good. But, I forgot... are they good?

RNNs introduced the idea of hidden state which can be seen as the memory of a cell. As revolutionary as it may seem at the outset, this suffers from a major shortcoming.

The RNN cell remembers only its last state, which could be drastically different from its previous-to-last states depending upon the last input.

So, e.g. let's say we want to generate IMDB-like scores by reading movie reviews, and we got something like:
The acting was very bad, direction was Ok, but I loved the movie.

RNN would probably rate it 5/5 because of the presence of the strong word loved appearing in the latter half in the sentence.

We know that the score, though positive, should be less than 5, perhaps 4.5 or 4 because there is a very strong dislike sentiment in the beginning - the direction was very bad.

RNNs face problem with long long-range dependencies, i.e. where the gap between the relevant information (context), and the place where it's needed is very large, e.g. predicting a score from the review The acting was very bad, direction was Ok, but I loved the movie would not be much correct if done by an RNN.

It's beyond any doubt that for any serious memory based predictions or classifications, we need some kind of a longer term memory.

That's what LSTM brings to the table. LSTM is a modified RNN that has a long term memory in addition to the short term memory.

Before we go into the specifics, let's see what we can deduce from theory itself.

C4.2.2 Long Term Memory

For a normal NN (neural network cell), we know that its output is governed by only one thing (for a given set of weights and biases):

the current input ($x_{t}$)

Then, we bring in RNNs. An RNN cell's output is its short term memory (STM) - or its hidden state $h_t$. And this STM is modified by two things:

current input ($x_{t}$)
previous STM ($h_{t-1}$)

Now, we have introduced another factor - the LTM or long term memory. The LSTM cell has two outputs - STM and LTM, the latter being also known as cell state $c_t$. Naturally, the LTM should be influenced by:

previous LTM ($c_{t-1}$)
current STM, which, in turn is given by concatenating:
- current input ($x_{t}$)
- previous STM ($h_{t-1}$)

If you think about memory in human beings, isn't this how we remember things?

Current inputs affect our short term memory, and the short term memory keeps on affecting our long term memory (the current input does not directly affect LTM).

We can start with something on the same lines with LTM.

The line at the top represents LTM or the long term memory of the cell ($c_t$). The bottom line represents STM or the short term memory of the cell ($h_t$).

At this point we just know that we have to make STM affect LTM, but in what way, and using what techniques, is unknown, and hence is denoted by a Black Box.

Ok, what can we do in the black box?

Based on STM, there are two things we can do to the LTM:

we can make the LTM forget some of itself,
we can make the LTM learn something new.

Something like this:

Now, if we look deeper, unlike forgetting, learning for LTM is actually comprised of two steps.

In forgetting, we would just ask LTM to forget some of its current state.

In learning, we would ask LTM to learn some of something new. Not only do we tell how much to learn, we give LTM the new object of learning as well. We also use the LTM to adjust the final state of STM before outputting it. Our cell should now look like this:

For deciding the degree of both forgetting and learning, we use sigmoid ($sig$) layer - this layer squishes STM or $h_t$ between 0 and 1. For telling LTM what to learn, we use $tanh$ layer. This layer suppresses STM between -1 and 1.

Lastly we do some adjustment of STM based on our revised LTM, and this is our final picture:

The three dashed rectangles shown in the diagram are aptly called the "forget gate", "input gate", and "output gate" respectively.

Component-wise operation means adding or multiplying the respective components of the inputs.

This means that a component-wise multiplication (used in the forget gate) with a vector having most of its components as zero will result in an almost-zero vector. Thus, the forget gate helps in deciding how much to forget.

The input gate uses component-wise addition (with LTM), and this it adds or subtracts something from the current value of LTM.

C4.2.3 Close to Common Sense

This ingenious way of using layers and operations results in a behavior that is very natural and logical to think about. Look at the table below to see if this makes sense to you.

This table uses the values we get if we use these functions and gates to predict the two states (cell and hidden) using the given states and inputs.

You can see how these values change according to the STM, input and LTM in the graph given here.

Note that:

+VE means a big positive number, +ve means a small one, same goes for -VE and -ve.
FG, IG mean forget gate and input gate respectively.

STM	FG	IG	Forget	Learn	Comment
+VE	1	1	Not	Much	Current LTM needs big boost - does not forget, and learns more
+ve	$\approx$ 1	$\approx$ +0	Little	Little	Current LTM still gets a boost - forgets a little, learns a little (common pathway)
-ve	$\approx$ +0	$\approx$ -0	Little more than +ve	Little	Current LTM needs to be slightly tweaked - forgets a little more, learns a little (common pathway)
-VE	0	0	All	Little	Current LTM needs to be modified a lot - forgets almost everything, and starts learning afresh

This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.

For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This means you are free to copy, share and build on this book, but not to sell it.

If you're interested in commercial use, please contact me. If you would like to donate, please donate to this website.

Observe LSTM

Here you can observe the functioning of an LSTM by interacting with the graph given below.

Drag the green point Prev. STM + input - which is the current STM or the hidden state $h_t$ - and the purple point Prev. LTM - which is the previous cell state ($c_{t-1}$) - vertically to see how they affect the different parameters of an LSTM. You can modify the gates (I mean their outputs) as well.

The way these values change (along with the outputs from the gates) is quite in line with what common sense would suggest - as described in the table below.

Note that:

+VE means a big positive number, +ve means a small one, same goes for -VE and -ve.
FG, IG mean forget gate and input gate respectively.

A Reason to Donate

The page has no or minimal advertisement. If you want to contribute towards keeping this website running, think about donating a small amount. This helps in reducing the number of ads as well. You can decide the amount on the payment page. Thanks!

Return to Types of Neural Networks

You are here:

Table of Contents

Subscribe to Our Newsletter

Categories

Tags

Author

manisar