Home > Software engineering >  Sklearn's GroupShuffleSplit is yielding overlapping results
Sklearn's GroupShuffleSplit is yielding overlapping results

Time:06-24

I have used GroupShuffleSplit in the past and it has worked OK. But now I am trying to split based on a column and it's producing overlap between the test and train data. This is what I am running

val_inds, test_inds = next(GroupShuffleSplit(test_size=0.5,
                                           n_splits=2,).split(df, groups=df['cl_uid'].values))


df_val = df[df.index.isin(val_inds)]
df_test = df[df.index.isin(test_inds)]

# this value is not zero
len(set(df_val.cl_uid).intersection(set(df_test.cl_uid)))

Any idea what could be going on?

sklearn version 0.24.1 and Python version 3.6

CodePudding user response:

The return of GroupShuffleSplit is the array indices so if you want to split your DataFrame you should use .iloc to filter.

df_val = df.iloc[val_inds]
df_test = df.iloc[test_inds]

If you mistakenly try to use the index to filter, then you are assuming that you have a non-duplicated RangeIndex that begins at 0. If that is not the case this filtering is bound to fail.


import pandas as pd
from sklearn.model_selection import GroupShuffleSplit

# DataFrame with a non-RangeIndex
df = pd.DataFrame({'clust_id': [1,1,1,2,2,2,2]}, index=[1,2,1,2,1,2,3])

val_inds, test_inds = next(GroupShuffleSplit(test_size=0.5, n_splits=2,).split(df, groups=df['clust_id']))

Correct splitting

df_val = df.iloc[val_inds]
#   clust_id
#2         2
#1         2
#2         2
#3         2

df_test = df.iloc[test_inds]
#   clust_id
#1         1
#2         1
#1         1

Incorrect splitting, confuses index labels with array-position labels

df[df.index.isin(val_inds)]
#   clust_id
#3         2

df[df.index.isin(test_inds)]
#   clust_id
#1         1
#2         1
#1         1
#2         2
#1         2
#2         2
  • Related