I want to create id_set.csv
. This file will contain the split of data between train/validation/test. It will have2 columns: ID and set. The IDs must be identical to the ones in dataset.csv. The set value must be one of "train", "validation" or "test". Data will be randomly split in 50-70% to the training set, 20-30% to the validation set and 10-20% to the test set.
# Train-test-validation split
train, test = train_test_split(self.df, test_size=0.2, random_state=1)
train, val = train_test_split(train, test_size=0.25, random_state=1)
Desired output
ID | set |
---|---|
r2_HG_3 | train |
r2_HG_4 | train |
r2_HG_5 | validation |
r2_HG_6 | validation |
r2_HG_7 | test |
r2_HG_8 | test |
CodePudding user response:
If you're working with pandas you could shuffle by rows using df.sample(frac=1)
, and then set the first 50-70% of the rows as training set, followed by 20-30% as validation set, and the final 10-20% as test set.
CodePudding user response:
If I have correctly understood the input for the split is a dataframe and it contains already the ID column, then:
# Train-test-validation split
train, test = train_test_split(self.df, test_size=0.2, random_state=1)
train, val = train_test_split(train, test_size=0.25, random_state=1)
# Assuming train, val, test are dataframes
# A string is assigned to the "set" column.
train.loc[:,'set'] = 'train'
val.loc[:,'set'] = 'val'
test.loc[:,'set'] = 'test'
# Concatenate all the dataframe together
id_set = pd.concat([train, val, test], axis=0)
id_set.to_csv('id_set.csv', index=False)