Sklearn's GroupShuffleSplit is yielding overlapping results-CodePudding

I have used GroupShuffleSplit in the past and it has worked OK. But now I am trying to split based on a column and it's producing overlap between the test and train data. This is what I am running

val_inds, test_inds = next(GroupShuffleSplit(test_size=0.5,
                                           n_splits=2,).split(df, groups=df['cl_uid'].values))


df_val = df[df.index.isin(val_inds)]
df_test = df[df.index.isin(test_inds)]

# this value is not zero
len(set(df_val.cl_uid).intersection(set(df_test.cl_uid)))

Any idea what could be going on?

sklearn version 0.24.1 and Python version 3.6

CodePudding user response：

The return of GroupShuffleSplit is the array indices so if you want to split your DataFrame you should use .iloc to filter.

df_val = df.iloc[val_inds]
df_test = df.iloc[test_inds]

If you mistakenly try to use the index to filter, then you are assuming that you have a non-duplicated RangeIndex that begins at 0. If that is not the case this filtering is bound to fail.

import pandas as pd
from sklearn.model_selection import GroupShuffleSplit

# DataFrame with a non-RangeIndex
df = pd.DataFrame({'clust_id': [1,1,1,2,2,2,2]}, index=[1,2,1,2,1,2,3])

val_inds, test_inds = next(GroupShuffleSplit(test_size=0.5, n_splits=2,).split(df, groups=df['clust_id']))

Correct splitting

df_val = df.iloc[val_inds]
#   clust_id
#2         2
#1         2
#2         2
#3         2

df_test = df.iloc[test_inds]
#   clust_id
#1         1
#2         1
#1         1

Incorrect splitting, confuses index labels with array-position labels

df[df.index.isin(val_inds)]
#   clust_id
#3         2

df[df.index.isin(test_inds)]
#   clust_id
#1         1
#2         1
#1         1
#2         2
#1         2
#2         2