How do a create a dataframe to store the information generated from train-test-validation split?-CodePudding

I want to create id_set.csv. This file will contain the split of data between train/validation/test. It will have2 columns: ID and set. The IDs must be identical to the ones in dataset.csv. The set value must be one of "train", "validation" or "test". Data will be randomly split in 50-70% to the training set, 20-30% to the validation set and 10-20% to the test set.

# Train-test-validation split
train, test = train_test_split(self.df, test_size=0.2, random_state=1)
train, val = train_test_split(train, test_size=0.25, random_state=1)

Desired output

ID	set
r2_HG_3	train
r2_HG_4	train
r2_HG_5	validation
r2_HG_6	validation
r2_HG_7	test
r2_HG_8	test

CodePudding user response：

If you're working with pandas you could shuffle by rows using df.sample(frac=1), and then set the first 50-70% of the rows as training set, followed by 20-30% as validation set, and the final 10-20% as test set.

CodePudding user response：

If I have correctly understood the input for the split is a dataframe and it contains already the ID column, then:

# Train-test-validation split
train, test = train_test_split(self.df, test_size=0.2, random_state=1)
train, val = train_test_split(train, test_size=0.25, random_state=1)

# Assuming train, val, test are dataframes
# A string is assigned to the "set" column.
train.loc[:,'set'] = 'train'
val.loc[:,'set'] = 'val'
test.loc[:,'set'] = 'test'

# Concatenate all the dataframe together
id_set = pd.concat([train, val, test], axis=0)
id_set.to_csv('id_set.csv', index=False)