I have a dataframe with some features and a target column belonging to {0,1}
.
I need to split this dataset into training, testing and validation sets. The validation part must be the 20% of the dataset, and the remaining 80% must be split so that the 80% of it goes into the training set. And this can be easily achieved with sklearn's train_test_split
My problem is that the splitting must be done in a stratified way based on the clusters I computed for both target values.
To compute the clusters I separated the entries for both targets into two subsets e.g.
ones = df[df_numerical['Target'] == 1].copy()
zeroes = df[df_numerical['Target'] == 1].copy()
Then for each subset I used kmeans to compute their clusters, and added the clusters to the dataframe, e.g.:
# the number of clusters for both variables is not the same
clusters_1 = kmeans_1.predict(ones[NUMERICAL_FEATURES])
ones['Cluster'] = clusters_1
clusters_0 = kmeans_0.predict(zeroes[NUMERICAL_FEATURES])
zeroes['Cluster'] = clusters_0
Now how can I split the datasets such that they are stratified by cluster size?
The splitting I need must be done in this way: assuming of having 100 records, 80 of class 1 and 20 of class 0, I need to split this records in a 70 / 30 %, so I need to have 56 (70% of 80) records of class 1 and 14 (70% of 20) of class 0. And I know this can be done using the stratify
parameter of train_test_split
, but my problem is that in addition to this, the splitting must be stratified also w.r.t the clusters of each target value.
One solution I thought would be of extracting the indices of the elements for both classes, putting them into lists, extracting from them the right number of elements and then re-combine the dataframes:
cluster_indices_0 = zeroes.groupby(['Cluster']).apply(lambda x: x.index)
cluster_indices_1 = ones.groupby(['Cluster']).apply(lambda x: x.index)
But in this way I'd have to manually compute, for each cluster the number of elements to pop, and I was looking for a way to do this automatically.
Is there a function in sklearn or pandas to achieve what I'm looking for without getting list in the computation of the number of elements to extract?
CodePudding user response:
Since you have your data already split by target, you simply need to call train_test_split
on each subset and use the cluster column for stratification.
train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])
then do the same for target one and combine all the subsets