I want to assign a sample class to each instance in a dataframe - 'train', 'validation' and 'test'. If I use sklearn train_test_split(), twice, I can get the indices for a train, validation and test set like this:
X = df.drop([target'], axis=1)
y=df[['target']]
X_train, X_test, y_train, y_test, indices_train, indices_test=train_test_split(X, y, df.index,
test_size=0.2,
random_state=10,
stratify=y,
shuffle=True)
df_=df.iloc[indices_train]
X_ = df_.drop(['target'], axis=1)
y_=df_[['target']]
X_train, X_val, y_train, y_val, indices_train, indices_val=train_test_split(X_, y_, df_.index,
test_size=0.15,
random_state=10,
stratify=y_,
shuffle=True)
df['sample']=['train' if i in indices_train else 'test' if i in indices_test else 'val' for i in df.index]
What is best practice to get a train, validation and test set? Is there any problems with my approach above and can it be frased better?
CodePudding user response:
a faster and optimal solution if dataset is large would be using numpy.
How to split data into 3 sets (train, validation and test)?
or the simpler way is your solution, but maybe just feed the x_train, y_train you obtained in the 1 step, for the train validation split? like the indices being stored and rows just removed from the df feels unnecessary.
CodePudding user response:
So, I did a dummy dataset of 100 points. I separate the data and I did the first split:
X = df.drop('target', axis=1)
y = df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
If you have a look, my test size is 0.3 which means 70 data points will go for traininf and 30 for test and validation as well.
X_train.shape # Output (70, 3)
X_test.shape # Output (30, 3)
Now you need to split again for validation, so you can do it like this:
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5)
Notice how I name the groups and the test_size is now 0.5. Which means I take the 30 points for test and I splitted for validation as well. So the shape of validation and testing, will be:
X_val.shape # Output (15, 3)
X_test.shape # Output (15, 3)
At the end you have 70 points for training, 15 for testing and 15 for validation. Now, consider validation as "double check" of your training. There are a lot of messy concepts related with that. It's just be sure of your training.