"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
July 1, 2021
Author - manisar
RNNs are normal neural networks with memory. Let us see what motivates this idea, and how it unfolds.
We have seen how we train a normal dense neural network and then feed it inputs - each consisting of a set of features - in order to get a predicted output. For a given input, its features are the only thing that determine the output we are going to get for it (once the weights and biases are learnt).
But there are situations in which it's not only the current input's features, but also the features of some or all of the previous inputs that determine the output.
For example, after training a neural network on dozens of novels, we can ask it what is supposed to come after 'the', i.e. what the output would be when the input is the word 'the'.
As silly as it may sound, our question is not completely open-ended, e.g. there cannot be 'the Roberts', or 'the Elizabeth'. But the number of valid words that can come after 'the' means that our network is going to give a huge number of possible outputs with very similar probabilities.
This is different from so many other cases where a single input results in only a single output. E.g. given the shape, size, color and the number of petals in a flower, we get a very specific name for the flower no matter where in our chain of inputs this specific input appears.
That is not the case with our question. The input 'the' can have so many different valid outputs. But now, if we change our input to 'jump the', suddenly the number of possible outputs are reduced to less than a dozen. Or, if we further change the input to 'Take your time, do not jump the', the required output is almost certain to be 'gun'.
But, the normal dense neural networks accept only one input at a time (yes, in batches but each input is handled independently), and it doesn't care about the previous inputs both in training and prediction. Such a network will neither be able to train, nor predict, differently for the single input 'the' and multiple inputs 'Take your time, do not jump the'!
This is where the need for RNN arises.
In an RNN, both during training and prediction, we make the previous inputs count.
Leaving aside the implementation details, the idea is very simple at the outset. But once we start thinking about it, there are some imminent difficulties. One instance is, when talking about previous inputs, how many previous inputs should we be using. For tackling this, we actually use the immediately previous output (technically the state vector) and not the previous inputs. This way, while generating the previous output, the RNN layer (conveniently called a cell) would have taken into account the previous-to-previous output and input both, and so, effectively we are bringing into the current step of learning the inputs from more than one previous steps of learning.
The pictures shown below has a neural network with one RNN and one dense layer. What we are seeing is snapshots of this network at different points of time. At any given point of time, the single RNN cell uses the output it generated for the last input, the current input and sends its output to the dense layer and stores it for using it itself when the next input arrives.
The phenomenon of a cell sending its output to its future self is described as the cell remembering its state. This nomenclature helps us differentiate two types of RNNs as we'll soon see.
There are two important things to note:
In order to accommodate the two factors above, and force our network to learn sequences instead of individual inputs, we do this:
One thing this new mechanism (of windows) changes immediately is the dimensionality of the input. In a normal neural network, the input, that is fed to the network one at a time, is two dimensional - a batch (of, say, $n$ inputs), consisting of $x$ features. Now, each of our inputs is two-fold. Our inputs now are windows of inputs with each input within a window having its own features. So, a RNN cell accepts three dimensional input which is [batch][input-per-time-step][features-per-input], or more verbally:
A little reflection shows that it must have a hit on the performance. Instead of just refining the learning using the current input's features, the cell is now required to use the output from its previous incarnation as well. So yes, RNNs are slower. For slight convenience, we can call and visualize the training using current input's features as spatial training, while the one using the previous inputs as temporal training. While CNNs take spatial training ton next level, RNNs do the deal by bringing in temporal training.
Have a look at this quick video - RNN Basic Plumbing on this page.
What we have seen above is the starting argument for RNNs. If you think about it, we haven't done anything conceptually different yet from a normal neural network. Yes, the implementation is slightly different, but under the hood it's the same thing. Let me explain how.
In a simple dense neural network, we shuffle and then divide our inputs into batches and feed them to the network. Each input has a set of features which is the sole determinant of its output. Now, we have collected our individual inputs into (possibly overlapping) windows and are passing those to the network. While handling a window, the RNN cells are using their memory - or previous outputs from the time these cells were working on other inputs from the same window. But intra-windows, the network is working exactly like a normal dense network. Between the windows, the outputs are not shared, and the RNN cell is said to have forgotten or discarded its state.
In place of the RNN cell, we might have simply devised a mini-network with its own dense sub-layers equal in number to the window size, where each sub-layer would have handled a specific input from the window passed to this cell and passed its output to the waiting next sub-layer. After processing the last input in the window, the our mini-network would simply discard the output (generated by its last sub-layer) after passing it to the next dense layer. It would have worked exactly like a RNN cell.
Effectively, instead of training our network on individual inputs, we have now trained it for windows of inputs having fixed length.
This approach has a few visible drawbacks:
Since the RNN cells do not retain or remember their states between the windows of inputs, such RNNs are called stateless RNNs. What we get out from a network having such RNNs are outputs similar to what we get in a normal dense network. Also note that each of the outputs is a vector, and hence such RNNs are also called sequence-to-vector RNNs. We can make stateless RNNs return sequences and then they become sequence-to-sequence RNNs.
There is another approach we can use, which is this:
What we are doing now is actually training the network for sequence of inputs (of any length), instead of training it for multiple mini-series of fixed length. Since the RNN cells do not forget their state, we now call them stateful RNNs. Note that now, each window is used only once for training (in one epoch) which makes the training very slow. (Remember that in stateless RNNs, by having overlapping windows, each individual input was used multiple times per epoch)
This is an ongoing effort which means more chapters are being added and the current content is being enhanced wherever needed.
For attribution or use in other works, please cite this book as: Manisar, "Neural Networks - What's Happening? An Intuitive Introduction to Machine Learning", SKKN Press, 2020.
The working of a recurrent neural network, RNN in short, is easier to understand if we see it as a simple extension of a normal neural network that includes the time dimension.
Let's see how.
Let's start with a simple neural network.
The one shown in the video has an input and an output layer, and two hidden layers with 5 neurons each that do all the computations.
We'll later be converting of one these hidden layers into an RNN layer.
Also, in this neural network, each input has 3 features or variables, and each output has 4.
RNNs have to deal with time which means we'll be needing arrows in our diagram, so let's bring in arrows.
And we'll need the horizontal axis for showing both the time and the inputs, so let's rotate the diagram now.
We can reduce some of the visual complexity by hiding the neurons in the hidden layers, and seeing each of these layers like a black box.
Further, we can combine the input and output features, i.e. let's represent all the $x$'s as a single $x$ and all the $\hat{y}$'s as a single $\hat{y}$.
Once again, what we are seeing in the video currently is a single input which is a set of three features, going to two hidden layers successively, which spit out a single output consisting of four variables.
Note that each of our hidden layers receives its input only from one place - from the layer preceding it.
The current input is the only thing (apart from the layer's internal weights and biases) that shapes the output.
This is true in many situations, e.g. if a neural network predicts house prices based on its carpet area, once the weights are learnt, and the biases are learnt, the carpet area of the houses around the house in question, don't affect this prediction.
But we know there are situations where previous inputs along with the current input shape the output.
Most obvious example is sentence completion, or time-series forecasting.
In RNNs, we simply allow this to happen - we make one or more layers of our network accept not only the current input but also be affected by the previous inputs.
Let's make the first hidden layer of our network an RNN layer, conveniently called a cell.
What we get is this.
The subscripts over $x$ show the order of the inputs w.r.t. time.
If you look at any RNN cell, while receiving the current input, it is also getting the output it generated for the previous input.
So, what we are looking at is not seven RNN cells, but just one at different points of time.
The minimum number of previous inputs needed for generating the first output can be configured, and it is 6 in our case.
In our case, $\hat{y}^6$ is the first output, I've used the superscript six only for matching it with its input.
Once we have the pipeline working, the outputs keep coming out normally.
So, this is how RNN works.
Once again, if we look at the network at any given point of time, we see that our whole network comprises of just the highlighted part shown.
It is like a normal neural network, albeit the RNN layer, in addition to accepting the current input, also accepts the output from its previous incarnation.
Return to Types of Neural Networks