Pandas stratified splitting into train, test, and validation set based on the target variable its cl-CodePudding

I have a dataframe with some features and a target column belonging to {0,1}. I need to split this dataset into training, testing and validation sets. The validation part must be the 20% of the dataset, and the remaining 80% must be split so that the 80% of it goes into the training set. And this can be easily achieved with sklearn's train_test_split

My problem is that the splitting must be done in a stratified way based on the clusters I computed for both target values.

To compute the clusters I separated the entries for both targets into two subsets e.g.

ones = df[df_numerical['Target'] == 1].copy()
zeroes = df[df_numerical['Target'] == 1].copy()

Then for each subset I used kmeans to compute their clusters, and added the clusters to the dataframe, e.g.:

# the number of clusters for both variables is not the same
clusters_1 = kmeans_1.predict(ones[NUMERICAL_FEATURES])
ones['Cluster'] = clusters_1

clusters_0 = kmeans_0.predict(zeroes[NUMERICAL_FEATURES])
zeroes['Cluster'] = clusters_0

Now how can I split the datasets such that they are stratified by cluster size?

The splitting I need must be done in this way: assuming of having 100 records, 80 of class 1 and 20 of class 0, I need to split this records in a 70 / 30 %, so I need to have 56 (70% of 80) records of class 1 and 14 (70% of 20) of class 0. And I know this can be done using the stratify parameter of train_test_split, but my problem is that in addition to this, the splitting must be stratified also w.r.t the clusters of each target value.

One solution I thought would be of extracting the indices of the elements for both classes, putting them into lists, extracting from them the right number of elements and then re-combine the dataframes:

cluster_indices_0 = zeroes.groupby(['Cluster']).apply(lambda x: x.index)
cluster_indices_1 = ones.groupby(['Cluster']).apply(lambda x: x.index)

But in this way I'd have to manually compute, for each cluster the number of elements to pop, and I was looking for a way to do this automatically.

Is there a function in sklearn or pandas to achieve what I'm looking for without getting list in the computation of the number of elements to extract?

CodePudding user response：

Since you have your data already split by target, you simply need to call train_test_split on each subset and use the cluster column for stratification.

train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])

then do the same for target one and combine all the subsets