I am trying to build a LSTM model
to predict the stock. I have split the dataset into training
and testing dataset
. And I put the testing dataset
into model.fit
as validation_data
parameter. Then, I would put the testing dataset in model.predict()
and generate the trend.
I am wondering if I put the validation data in model.fit()
, would overfitting
occur when I use the same set of data to generate the prediction? Should I split the raw data into 3 set instead - training, validation and testing
? Validation data
would be put in model.fit()
whilst testing data
would be put in model.predict()
.
Sample Code:
model_lstm = Sequential()
model_lstm.add(LSTM(50, return_sequences = True, input_shape = (X_train.shape[1], X_train.shape[2])))
model_lstm.add(LSTM(units=50, return_sequences=True))
model_lstm.add(LSTM(units=50, return_sequences=True))
model_lstm.add(LSTM(units=50))
model_lstm.add(Dense(units=1, activation='relu'))
model_lstm.compile(loss = 'mse', optimizer = 'adam')
model_lstm.summary()
history_lstm = model_lstm.fit(X_train,
y_train,
validation_data = (X_test, y_test),
epochs = 10,
batch_size=32,
shuffle=False)
CodePudding user response:
Usually, you would split the data into 3 sets:
- train set: used to train the model
- validation set: used for frequent evaluation of the model, allow to fine-tune hyper-parameters. MUSTN'T be used to train, as the evaluation must be the most unbiased possible.
- test set: final set used for the evaluation of the model.
As indicated by the name of the argument (validation_set
) you are supposed to put the validation set here.
As you thought, allowing the model to try and "validate" the hyper-parameters on the test set could lead to overfitting.
As for the ratio, the greater the number of hyper-parameters of your model, the bigger the validation set should be (also, look into "cross validation": this will help if the train set is too small for you to be able to take a big part of it for the validation set without impacting the performances)