5.2 LSTM and GRU#
U5.02 - Long Short Term Memory RNN#
The main drawback of conventional RNNs is its inability to learn long term dependency, or even the capacity of capturing long and short dependences at the same time.
Remenber that in a RNN:
Therefore, during the training phase of one time series, the matrix \(\bf{V}\), which contains the weights of the feedback loop, mulplies by itself \((\tau-1)\) times. Thus, if its values are close to zero, the weights end up vanishing. On the contrary, if the weights of \(\bf{V}\) are to large, they end up diverging (in case of no regularization method be included). This fact makes conventional RNNs very unstable.
They are also very sensitive to vanishing gradients phenomena, but it can be overcome by using Relu or LeakyRelu activation functions.
LSTMs are a type of RNNs proposed to takle the former problems. They were introduced in 1997 and are based on different type of basic unit called cell.
The cells use the principle of cumulative average called Exponential Weighted Moving Average (EWMA) originally proposed for a type of units called leaky units. EWMA takes into account more or less information from the past based on a \(\beta\) paratemer. The rule is given by: \(\mu^{(t)} \leftarrow \beta \mu^{(t-1)} + (1 - \beta)\upsilon^{(t)}\).
import numpy as np
import matplotlib.pyplot as plt
# make a hat function, and add noise
x = np.linspace(0,1,100)
x = np.hstack((x,x[::-1]))
x += np.random.normal( loc=0, scale=0.1, size=200 )
plt.plot( x, 'k', alpha=0.5, label='Raw' )
Beta1 = 0.8
Beta2 = 0.5
x1 = np.zeros(200)
x2 = np.copy(x1)
for i in range(1,200):
x1[i] = Beta1*x1[i-1] + (1-Beta1)*x[i]
x2[i] = Beta2*x2[i-1] + (1-Beta2)*x[i]
# regular EWMA, with bias against trend
plt.plot( x1, 'b', label='EWMA, Beta = 0.8' )
# "corrected" (?) EWMA
plt.plot( x2, 'r', label='EWMA, Beta = 0.5' )
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#savefig( 'ewma_correction.png', fmt='png', dpi=100 )

The LSTM network uses the same principle the level of memory or time dependence, but instead of one controling parameter, it define gates adjusted during the training phase.
Every cell LSTM contains three gates (the three \(\sigma's\) in the former graph):
The first step in the LSTM is to decide what information is going to be throwed away from the cell state. This decision is made by a sigmoid layer called the forget gate layer. It looks at \(h_{t−1}\) and \(x_t\), and outputs a number between 0 and 1 for each number in the cell state \(C_{t−1}\). A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”
The next step is to decide what new information is going to be stored in the cell state. This has two parts. First, a sigmoid layer called the input gate layer decides which values will be updated. Next, a tanh layer creates a vector of new candidate values, \(\tilde{C}_t\), that could be added to the state.
Finally, the cell decides what is going to output. This output will be based on the cell state, but will be a filtered version. First, it runs a output gate layer which decides what part of the cell state is going to output. Then, the cell state is passed through a tanh function (to push the values to be between −1 and 1) and multiplied it by the output of the gate.
Based on these gates, the state of the cell and output of the cell can be calculated as:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, GRU, SimpleRNN
import math
from sklearn.metrics import mean_squared_error
#Esta celda es por problemas de compatibilidad con la GPU en la última actualización
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
# First, we get the data
dataset = pd.read_csv('local/data/KO_2006-01-01_to_2018-01-01.csv', index_col='Date', parse_dates=['Date'])
Open | High | Low | Close | Volume | Name | |
Date | ||||||
2006-01-03 | 20.40 | 20.50 | 20.18 | 20.45 | 13640800 | KO |
2006-01-04 | 20.50 | 20.54 | 20.33 | 20.41 | 19993200 | KO |
2006-01-05 | 20.36 | 20.56 | 20.29 | 20.51 | 16613400 | KO |
2006-01-06 | 20.53 | 20.78 | 20.43 | 20.70 | 17122800 | KO |
2006-01-09 | 20.74 | 20.84 | 20.62 | 20.80 | 13819800 | KO |
# Checking for missing values
training_set = dataset[:'2015'].iloc[:,1:2].values
test_set = dataset['2016':].iloc[:,1:2].values
test_set[np.isnan(test_set)] = dataset['High'].max()
# We have chosen 'High' attribute for prices. Let's see what it looks like
plt.legend(['Training set (Before 2016)','Test set (2016 and beyond)'])
plt.title('KO stock price')

# Scaling the training set
sc = MinMaxScaler(feature_range=(0,1))
training_set_scaled = sc.fit_transform(training_set)
from local.lib.DataPreparationRNN import create_dataset
look_back = 10
X_train, y_train = create_dataset(training_set_scaled, look_back)
(2507, 10)
# The RNN architecture
model = Sequential()
# First RNN layer with Dropout regularisation
# The output layer
Model: "sequential"
Layer (type) Output Shape Param #
simple_rnn (SimpleRNN) (None, 50) 2600
dropout (Dropout) (None, 50) 0
dense (Dense) (None, 1) 51
Total params: 2,651
Trainable params: 2,651
Non-trainable params: 0
Let’s remember what a RNN can do:
# Compiling the RNN
# Fitting to the training set
<tensorflow.python.keras.callbacks.History at 0x7fedb4135d90>
dataset_total = pd.concat((dataset["High"][:'2016'],dataset["High"]['2017':]),axis=0)
inputs = dataset_total[len(dataset_total)-len(test_set) - look_back:].values
inputs[np.isnan(inputs)] = dataset['High'].max()
inputs = inputs.reshape(-1,1)
inputs = sc.transform(inputs)
(513, 1)
# Preparing X_test and predicting the prices
X_test = []
for i in range(look_back,inputs.shape[0]):
X_test = np.array(X_test)
X_test = np.reshape(X_test, (X_test.shape[0],X_test.shape[1],1))
predicted_stock_price = model.predict(X_test)
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
# Visualizing the results
plt.plot(test_set, color='red',label='Real KO Stock Price')
plt.plot(predicted_stock_price, color='blue',label='Predicted KO Stock Price')
plt.title('KO Stock Price Prediction(RNN)')
plt.ylabel('KO Stock Price')

# Evaluating our model
import math
from sklearn.metrics import mean_squared_error
rmse = math.sqrt(mean_squared_error(test_set, predicted_stock_price))
print("The root mean squared error is {}.".format(rmse))
The root mean squared error is 0.9446786845179417.
Now using a LSTM:
# The LSTM architecture
regressor = Sequential()
# First LSTM layer with Dropout regularisation
regressor.add(LSTM(units=50, input_shape=(X_train.shape[1],1)))
Model: "sequential_1"
Layer (type) Output Shape Param #
lstm (LSTM) (None, 50) 10400
dropout_1 (Dropout) (None, 50) 0
dense_1 (Dense) (None, 1) 51
Total params: 10,451
Trainable params: 10,451
Non-trainable params: 0
# The LSTM architecture
regressor = Sequential()
# First LSTM layer with Dropout regularisation
regressor.add(LSTM(units=50, input_shape=(X_train.shape[1],1)))
# Compiling the RNN
# Fitting to the training set
<tensorflow.python.keras.callbacks.History at 0x7fed80592250>
predicted_stock_price = regressor.predict(X_test)
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
# Visualizing the results
plt.plot(test_set, color='red',label='Real KO Stock Price')
plt.plot(predicted_stock_price, color='blue',label='Predicted KO Stock Price')
plt.title('KO Stock Price Prediction(LSTM)')
plt.ylabel('KO Stock Price')

# Evaluating our model
rmse = math.sqrt(mean_squared_error(test_set, predicted_stock_price))
print("The root mean squared error is {}.".format(rmse))
The root mean squared error is 0.7668002979683394.
Gated Recurrent Units#
The GRU unit does not have to use a memory unit to control the flow of information like the LSTM unit. It can directly makes use of the all hidden states without any control. GRUs have fewer parameters and thus may train a bit faster or need less data to generalize. But, with large data, the LSTMs with higher expressiveness may lead to better results. Source
Here \(r\) is a reset gate, and \(z\) is an update gate. Intuitively, the reset gate determines how to combine the new input with the previous memory, and the update gate defines how much of the previous memory to keep around. If set the reset to all 1’s and update gate to all 0’s, it will arrive at the vanilla RNN model.
# The GRU architecture
regressor2 = Sequential()
# First GRU layer with Dropout regularisation
regressor2.add(GRU(units=50, input_shape=(X_train.shape[1],1)))
# The output layer
Model: "sequential_3"
Layer (type) Output Shape Param #
gru (GRU) (None, 50) 7950
dropout_3 (Dropout) (None, 50) 0
dense_3 (Dense) (None, 1) 51
Total params: 8,001
Trainable params: 8,001
Non-trainable params: 0
# Compiling the RNN
# Fitting to the training set
<tensorflow.python.keras.callbacks.History at 0x7fed302f34d0>
Note that the every epoch runs a little bit faster than in the LSTM model.
predicted_stock_price = regressor.predict(X_test)
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
# Visualizing the results
plt.plot(test_set, color='red',label='Real KO Stock Price')
plt.plot(predicted_stock_price, color='blue',label='Predicted KO Stock Price')
plt.title('KO Stock Price Prediction(GRU)')
plt.ylabel('KO Stock Price')

# Evaluating our model
rmse = math.sqrt(mean_squared_error(test_set, predicted_stock_price))
print("The root mean squared error is {}.".format(rmse))
The root mean squared error is 0.7668002979683394.
Interesting readings:
Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs/