Best practice for train, validation and test set-CodePudding

I want to assign a sample class to each instance in a dataframe - 'train', 'validation' and 'test'. If I use sklearn train_test_split(), twice, I can get the indices for a train, validation and test set like this:

X = df.drop([target'], axis=1)
y=df[['target']]

X_train, X_test, y_train, y_test, indices_train, indices_test=train_test_split(X, y, df.index, 
                                                                             test_size=0.2, 
                                                                             random_state=10, 
                                                                             stratify=y, 
                                                                             shuffle=True)
df_=df.iloc[indices_train]

X_ = df_.drop(['target'], axis=1)
y_=df_[['target']]

X_train, X_val, y_train, y_val, indices_train, indices_val=train_test_split(X_, y_, df_.index, 
                                                                             test_size=0.15, 
                                                                             random_state=10, 
                                                                             stratify=y_, 
                                                                             shuffle=True)

df['sample']=['train' if i in indices_train else 'test' if i in indices_test else 'val' for i in df.index]

What is best practice to get a train, validation and test set? Is there any problems with my approach above and can it be frased better?

CodePudding user response：

a faster and optimal solution if dataset is large would be using numpy.

How to split data into 3 sets (train, validation and test)?

or the simpler way is your solution, but maybe just feed the x_train, y_train you obtained in the 1 step, for the train validation split? like the indices being stored and rows just removed from the df feels unnecessary.

CodePudding user response：

So, I did a dummy dataset of 100 points. I separate the data and I did the first split:

X = df.drop('target', axis=1)
y = df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

If you have a look, my test size is 0.3 which means 70 data points will go for traininf and 30 for test and validation as well.

X_train.shape # Output (70, 3)
X_test.shape # Output (30, 3)

Now you need to split again for validation, so you can do it like this:

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5)

Notice how I name the groups and the test_size is now 0.5. Which means I take the 30 points for test and I splitted for validation as well. So the shape of validation and testing, will be:

X_val.shape # Output (15, 3)
X_test.shape # Output (15, 3)

At the end you have 70 points for training, 15 for testing and 15 for validation. Now, consider validation as "double check" of your training. There are a lot of messy concepts related with that. It's just be sure of your training.