Everything about Recurrent Neural Networks (RNN)

Harsh
8 min readNov 9, 2020

--

Why do we use RNN’s ? An introduction :-+

RNN’s are widely used today for handling sequential data because standard neural networks do not give anywhere near satisfactory predictions . The explanation for this argument is that the things it knows can not be remembered by feedforward vanilla neural networks. It starts fresh with each iteration you train the network, it doesn’t remember what it saw in the previous iteration when you process the current data set. When detecting associations and data trends, this is a major drawback.

This is where RNN comes into picture. RNNs have a very unique architecture that enables them to model (hidden state) memory units that allow data to persist, thus being able to model short-term dependencies. For this reason, in time-series forecasting, RNNs are widely used to recognize data correlations and patterns. Let me elaborate a little bit by stating that — RNNs share parameters across the sequence ‘s different position / time index / time steps, allowing examples of different sequence lengths to be generalized well. Usually, RNN is a better alternative to position-independent classifiers and sequential models that deal differently with each position.

How does a RNN share parameters? Each member of the output is produced using the same update rule applied to the previous outputs. Such update rule is often a (same) NN layer

How will I practically use RNN ?

If you have been reading my articles, you would know that i had once used an LSTM model to predict the stock market and hence you might already know that LSTM is a RNN. I had used LSTM for this exact same reason that the stock market training data is a sequential data form and using other sequential networks wont produce any viable results . ( Look at the below article for reference )

If you had blindly made an LSTM model like this before and would like to understand the working of RNN’s and how does it really work , you might want to stick around till the end of this article.

Understanding the architecture of RNN

Searching the internet about RNN’s can be a bit overwhelming , so I will try to explain it with a little example here.

First things first , RNN’s the most basic and important thing you need to know is that RNN input needs to have 3 dimensions . Namely batch size , number of steps and number of features. Batch size is a term used in machine learning and refers to the number of training examples utilized in one iteration. The number of steps depicts the number of time steps/segments you will be feeding in one line of input of a batch of data that will be fed into the RNN.

In TensorFlow, the RNN unit is called the “RNN cell” . This name itself has created a great deal of confusion among individuals. On Stackoverflow, there are several questions that ask whether “RNN cell” applies to one single cell or the entire layer. Ok, the whole layer is more like that. The explanation for this is that the ties in RNNs are recurrent, thereby adopting a “feeding to itself” strategy. The RNN layer consists essentially of a single rolled RNN cell that unrolls according to the value of the “number of steps” (number of time steps / segments) that you include.

As I described earlier, the ability to model short-term dependencies is the primary specialty in RNNs. This is because of the secret condition within the RNN. It maintains data flowing through the unrolled RNN units from one phase to another. There is a secret state in each unrolled RNN unit. The hidden state of the current time steps is determined using knowledge about the hidden state of the previous time stage and the current input. This approach helps to preserve details about what the model saw in the previous time stage while processing information about the current time stages. Also, something to remember is that there are weights and prejudices in all relations in RNN. Biases in some architectures may be optional. In later portions of the paper, this process will be further clarified.

Since you have a simple idea now, let’s break down the process of execution with an example. Say your batch size is 6, the size of RNN is 7, the number of time steps / segments you will include in one input line is 5, and the number of characteristics is 3 in one time step. Your input tensor (matrix) form for one batch will look something like this if this is the case:

Tensor shape of one batch = (6,5,3)

This is called the sliding windows approach and is used in time series analysis.

Note — I will not be discussing data preprocessing here

When first feeding the data into the RNN. It will have a rolled architecture as shown below:

But when the RNN starts to process the data it will unroll and produce outputs as shown below:

Processing a batch

It starts processing from the 1st line of input when you feed a batch of data into the RNN cell. Similarly, the RNN cell will process all the input lines sequentially in the data batch that has been fed and give one output at the end that contains all the outputs of all the input lines.

Processing a single line of input:

In order to process a line of input, the RNN cell unrolls “number of steps” times. You can see this in the above figure. Since i defined “number of steps” as 5, the RNN cell has been unrolled 5 times.

Stay with me here , It goes something like this -

  • First, the initial hidden state (S) is multiplied, which is usually a vector of zeros and the hidden state weight (h), and then the result is added to the hidden state bias. Meanwhile, the input is multiplied at step t ([1,2,3]) and the input weight I and the input bias is applied to that result. At time stage t, we can obtain the hidden state by sending the addition of the two results above via an activation function , usually tanh (f)
  • Then, the hidden state (S) at time step t is multiplied by the output weight (O) at time step t to obtain the output at time step t, and then the output bias is applied to the result.
  • The hidden state (S) at the time step t is multiplied by the hidden state weight (h) when calculating the hidden state at the time step t+1, and the hidden state bias is applied to the result. Then, as stated before, the input weight will be multiplied by the input weight I at step t+1 ([4,5,6]) and the input bias will be applied to the result. Via an activation function , usually tanh (f), these two results will then be sent.
  • Then, the hidden state (S) at time step t+1 is multiplied by the output weight (O) at time step t+1 to obtain the output at time step t+1, and then the output bias is applied to the outcome. As you can see, it not only uses the time step t+1 input data but also uses data information in time step t through the hidden state at time step t+1 when producing the output of time step t+1.
  • Repeat this for all the number of steps

We will have 5 outputs of form (1,7) after processing all time steps in one line of input in the batch. So when you concatenate all these outputs together. (1,5,7) becomes the form. We get 6 size outputs (1,5,7) when all the input lines of the batch are finished processing. Therefore, the whole batch’s final production will be (6,5,7).

Note: All the hidden state weights, output weights and input weights have the same value throughout all the connections in a RNN.

Let’s dive deeper shall we ?

Now that we know about the basics of RNN , let’s have a look at the diffrent variations of RNN’s

  • Variation 1 of RNN (basic form): hidden2hidden connections, sequence output.

The basic equations that defines the above RNN is

The total loss would then be just the sum of the losses over all the time steps for a given sequence of x values paired with a sequence of y values. For example, if the negative log-likelihood is L(t), From y (t) given by x (1), you get the loss for the sequence, x ( t), then sum them up as shown.

  • Variation 2 of RNN output2hidden, sequence output. As shown in Fig 10.4, it produces an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step

Teacher forcing can be used to train RNN, where there are only output2hidden links, i.e. there are no hidden2hidden links.
The model is trained in teaching forcing to maximise the conditional probability of current output y(t), given both the x sequence so far and the previous output y(t-1), i.e. use the previous time step gold-standard output in training.

  • Variation 3 of RNN hidden2hidden, single output. As Fig 10.5 recurrent connections between hidden units, that read an entire sequence and then produce a single output

We’ve come to the end of the article now. I discussed the data manipulation, representation process within an RNN in TensorFlow, different variances of RNNs, etc. in this article. With all the data presented, I hope you now have a clear understanding of how RNNs function.

--

--