Predicting a multiple forward time step of a time series using LSTM

Question

I want to predict certain values that are weekly predictable (low SNR). I need to predict the whole time series of a year formed by the weeks of the year (52 values - Figure 1)

My first idea was to develop a many-to-many LSTM model (Figure 2) using Keras over TensorFlow. I'm training the model with a 52 input layer (the given time series of previous year) and 52 predicted output layer (the time series of next year). The shape of train_X is (X_examples, 52, 1), in other words, X_examples to train, 52 timesteps of 1 feature each. I understand that Keras will consider the 52 inputs as a time series of the same domain. The shape of the train_Y are the same (y_examples, 52, 1). I added a TimeDistributed layer. My thought was that the algorithm will predict the values as a time series instead of isolated values (am I correct?)

The model's code in Keras is:

y = y.reshape(y.shape[0], 52, 1)
X = X.reshape(X.shape[0], 52, 1)
# design network
model = Sequential()
model.add(LSTM(n_neurons, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(loss='mean_squared_error', optimizer='adam')
# fit network
model.fit(X, y, epochs=n_epochs, batch_size=n_batch, verbose=2)

The problem is that the algorithm is not learning the example. It is predicting values very similar to the attributes' values. Am I modeling the problem correctly?

Second question: Another idea is to train the algorithm with 1 input and 1 output, but then during the test how will I predict the whole 2015 time series without looking to the '1 input'? The test data will have a different shape than the training data.

I have data from 10 years. If my training dataset are: values from 4 weeks to predict the 5th and I keep shifting, I can have almost 52 X 9 examples to train the model and 52 to predict (last year) — Lucas Brito, Dec 03 '17 at 00:37
You do expect to have a large amount of error as you more closer to week 52 in your prediction, correct? If this type of forecasting could be so easily an accurately done with LSTMs, we would never use any other method — DJK, Dec 13 '17 at 00:15

Daniel Möller · Answer 1 · 2019-02-12T20:12:49.173

29

Sharing the same concerns about having too little data, you can do that like this.

First, it's a good idea to keep your values between -1 and +1, so I'd normalize them first.

For the LSTM model, you must make sure you're using return_sequences=True.
There is nothing "wrong" with your model, but it may need more or less layers or units to achieve what you desire. (There is no clear answer to this, though).

Training the model to predict the next step:

All you need is to pass Y as a shifted X:

entireData = arrayWithShape((samples,52,1))
X = entireData[:,:-1,:]
y = entireData[:,1:,:]

Train the model using these.

Predicting the future:

Now, for predicting the future, since we need to use predicted elements as input for more predicted elements, we are going to use a loop and make the model stateful=True.

Create a model equal to the previous one, with these changes:

All LSTM layers must have stateful=True
The batch input shape must be (batch_size,None, 1) - This allows variable lengths

Copy the weights of the previously trained model:

newModel.set_weights(oldModel.get_weights())

Predict only one sample at a time and never forget to call model.reset_states() before starting any sequence.

First predict with the sequence you already know (this will make sure the model prepares its states properly for predicting the future)

model.reset_states()
predictions = model.predict(entireData)

By the way we trained, the last step in predictions will be the first future element:

futureElement = predictions[:,-1:,:]

futureElements = []
futureElements.append(futureElement)

Now we make a loop where this element is the input. (Because of stateful, the model will understand it's a new input step of the previous sequence instead of a new sequence)

for i in range(howManyPredictions):
    futureElement = model.predict(futureElement)
    futureElements.append(futureElement)

This link contains a complete example predicting the future of two features: https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb

edited Feb 12 '19 at 20:12

answered Dec 08 '17 at 17:09

Daniel Möller

74,597
15
158
180

you also need to set batch_size for the Keras recursive layer to store the state memory or use the batch_input_size=(x,y,z) (instead of input_size=(x,y), batch = z) – Veltzer Doron Dec 18 '18 at 16:45
@DanielMöller Does `stateful=True` makes the output of the previous time step fed as an input of the next time-step? Are there any details that I need to know about this statefulness? – ajaysinghnegi Dec 27 '18 at 02:03
1

The output of the previous step is **always** an input to the next step. Stateful doesn't change **anything** in how the layer works, except that it allows dividing "one sequence in many batches". You decide manually when the sequence is ended (instead of letting the system consider every batch as containing entire sequences). --- See more: https://stackoverflow.com/questions/38714959/understanding-keras-lstms/50235563#50235563 – Daniel Möller Dec 27 '18 at 02:08
@DanielMöller I’m asking this because the answer says that - There are 2 ways specified in the `achieving one to many` section where it’s written that we can use a `stateful=True` to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features). – ajaysinghnegi Dec 27 '18 at 15:11
@DanielMöller Also, It reads `Stateful allows us to input "parts" of the sequences in stages`. So, does this statement also mean the `window` concept that you have mentioned in the answer where a large sequence is divided into different independent samples? If yes, then doesn't this only applies to the cases where we have very large sequences and want to eliminate the possibility of learning such large sequences? – ajaysinghnegi Dec 27 '18 at 15:12
@DanielMöller Great summary, the only thing missing is that we want to reset the states and set the last states from previous predictions using `reset_states(states=[state_h, state_c])`, where `[predictions[1], predictions[2]]=[state_h, state_c]` and `futureElement = predictions[0][:,-1:,:]`. I haven't tested it, but I think that's how it's supposed to work? Please comment :) – GRS Jan 07 '19 at 14:07
@DanielMöller Additionally, it would be great to know how to set states for each layer independently (e.g. when LSTMs are stacked). – GRS Jan 07 '19 at 14:09
@DanielMöller, in the link you posted for the full example (great work btw!), you added the following layer: model.add(Lambda(lambda x: x*1.3)). However, aren't you now using knowledge of how the training and test data came to be and as such, this layer should not allowed to be included? – Riley Jun 14 '19 at 13:17
@DanielMöller, and actually, could your example with sine's be extended with a "look_back" option, where you look for patterns in the "look_back" pervious time slots? – Riley Jun 14 '19 at 13:49
Hi, I still have one doubt regarding your answer. Why would you train a network without stateful=True if you will use its weights in a stateful one after ? Is it because the training data is composed of multiple sequences which have nothing to do with others, and it is simpler than "manual" training with multiple calls to `reset_state` ? – linSESH Jul 23 '19 at 12:37
Yes, training without `stateful=True` is way easier, with free batch size. – Daniel Möller Jul 23 '19 at 12:56
So if I understand well, if my training set is a very long sequence, I better train my network without the stateful parameter because I could use bigger batches instead of updating weights at each small subsequence and thus performing SGD ? I don't know if I'm clear – linSESH Jul 23 '19 at 16:23
It's an option. You can use everything at your own convenience. I don't like `stateful=True` because of all the trouble and care, I only use it when strictly necessary. – Daniel Möller Jul 23 '19 at 16:25
Thanks for the help and the quick answers – linSESH Jul 23 '19 at 16:31
About the performance of using small batches versus big batches, there might be some difference indeed, but I can't say which is better, and that's probably data dependent. – Daniel Möller Jul 23 '19 at 16:37
Wont your "greedy" prediction algorithm generate implausible forecasts? Suppose you start at A and you can select either A which is likely and B which is unlikely. The algorithm selects A and then A again and so on, yielding AAAA..., while a more probable sequence would have been AAABAABAAAAABAAA. – Björn Lindqvist May 28 '20 at 13:35
It's a stateful LSTM, it's not blind to the past. – Daniel Möller May 30 '20 at 03:20

Imran · Answer 2 · 2017-12-03T11:34:46.797

I have data from 10 years. If my training dataset are: values from 4 weeks to predict the 5th and I keep shifting, I can have almost 52 X 9 examples to train the model and 52 to predict (last year)

This actually means you have only 9 training examples with 52 features each (unless you want to train on highly overlapping input data). Either way, I don't think this is nearly enough to merit training an LSTM.

I would suggest trying a much simpler model. Your input and output data is of fixed size, so you could try sklearn.linear_model.LinearRegression which handles multiple input features (in your case 52) per training example, and multiple targets (also 52).

Update: If you must use an LSTM then take a look at LSTM Neural Network for Time Series Prediction, a Keras LSTM implementation which supports multiple future predictions all at once or iteratively by feeding each prediction back in as input. Based on your comments this should be exactly what you want.

The architecture of the network in this implementation is:

model = Sequential()

model.add(LSTM(
    input_shape=(layers[1], layers[0]),
    output_dim=layers[1],
    return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(
    layers[2],
    return_sequences=False))
model.add(Dropout(0.2))

model.add(Dense(
    output_dim=layers[3]))
model.add(Activation("linear"))

However, I would still recommend running a linear regression or maybe a simple feed forward network with one hidden layer and comparing accuracy with the LSTM. Especially if you are predicting one output at a time and feeding it back in as input your errors could easily accumulate giving you very bad predictions further on.

Ok, but if I still want to use LSTM. Is there a way to train the model where the output is 1 step forward, but predict many steps forward? — Lucas Brito, Dec 03 '17 at 01:43
Sorry I don't understand. The "output" of a model is exactly what you are trying to predict — Imran, Dec 03 '17 at 01:58
OK, I thought about it some more and I think I understand what you are asking. You can build a network that predicts one step ahead, and then feed the prediction back in with the next input. In this case your input moves over a sliding window and inputs overlap with each other. See [this](https://github.com/jaungiers/LSTM-Neural-Network-for-Time-Series-Prediction) implementation. — Imran, Dec 03 '17 at 09:36
Dropout is a regularization technique that helps your model generalize. You don't necessarily need it unless you are overfitting your training data. — Imran, Dec 13 '17 at 01:53

score 2 · Answer 3 · edited Apr 30 '19 at 17:41

I'd like to add to this question

I added a TimeDistributed layer. My thought was that the algorithm will predict the values as a time series instead of isolated values (am I correct?)

as I myself had quite a hard time understanding the functionality behind the Keras TimeDistributed layer.

I'd argue that your motivation is right to not isolate the calculations for a Time Series Prediction. You specifically do wan't to get the characteristics and interdependencies of the whole series thrown together, when predicting its future shape.

However, that's exactly the opposite of what the TimeDistributed Layer is for. It is for isolating the calculations on each timestep. Why is this useful, you might ask? For completely different tasks, e.g. sequence labelling where you have a sequential input (i1, i2, i3, ..., i_n) and aim at outputting the labels (label1, label2, label1, ..., label2) for each timestep separately.

Imho the best explanation can be found in this post and in the Keras Documentation.

For this reason, I'd claim, that against all intuition adding a TimeDistributed layer is likely never a good idea for Time Series Prediction. Open and happy to hear other opinions about that!

Please ask a new question instead. – GKE Jan 17 '19 at 23:57 — GKE, Jan 17 '19 at 23:57

Predicting a multiple forward time step of a time series using LSTM

3 Answers3

Linked