Home > Software engineering >  What is the difference between train, validation and out of time validation data?
What is the difference between train, validation and out of time validation data?

Time:02-15

I have a dataset of 6Lakh records and i am required to divide the dataset into 70%, 15% and 15% were 70% of the data should be train data, 15% of the data should be validation data and another 15% of the data should be out of time validation data.

So far i am familiar with only train and test data. I will train the model with the train data and i will test the model with the test data.

How can i split the data into this 3 parts? and after splitting how can i test the model performance since i will have 3 datasets after splitting?

The splitting should be based on stratified sampling.

CodePudding user response:

You are training the model on the training batch of the dataset - that is easy.

On the validation dataset, You are expected to optimise hyperparameters of the model, optimise feature extraction, preprocessing, etc.

Then, when you are happy with the pipeline, you might refit the model on the training validation datasets and run the final performance test on the last 15% of the data (test batch). It ensures that there is no over-use of the same data for pipeline hyperparameters optimisation and testing.

If you are more ambitious and there is no strong temporal component in the data, you might want to perform K-fold cross-validation or even nested K-fold. The latter link has a more detailed explanation on the matter of cross-validation.

  • Related