I have dataset whose Label is 0 or 1.
I want to divide my data into test and train sets.For this, I used the
train_test_split
method from sklearn at first,
But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.
How can I do this?
CodePudding user response:
Refer to the official documentation sklearn.model_selection.train_test_split.
You want to specify the response variable with the stratify
parameter when performing the split.
Stratification preserves the ratio of the class variable when the split is performed.
CodePudding user response:
Split your dataset in class 1 and class 0, then split as you want:
df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]
test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)
test = pd.concat((test_0, test_1),
axis = 1,
ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1),
axis = 1,
ignore_index = True).sample(1)