Spliting dataset to train and test in python-CodePudding

I have dataset whose Label is 0 or 1.

I want to divide my data into test and train sets.For this, I used the train_test_split method from sklearn at first, But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.

How can I do this?

CodePudding user response：

Refer to the official documentation sklearn.model_selection.train_test_split.

You want to specify the response variable with the stratify parameter when performing the split.

Stratification preserves the ratio of the class variable when the split is performed.

CodePudding user response：

Split your dataset in class 1 and class 0, then split as you want:

df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]

test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)

test = pd.concat((test_0, test_1), 
                    axis = 1, 
                    ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1), 
                    axis = 1, 
                    ignore_index = True).sample(1)