I have used GroupShuffleSplit
in the past and it has worked OK. But now I am trying to split based on a column and it's producing overlap between the test and train data. This is what I am running
val_inds, test_inds = next(GroupShuffleSplit(test_size=0.5,
n_splits=2,).split(df, groups=df['cl_uid'].values))
df_val = df[df.index.isin(val_inds)]
df_test = df[df.index.isin(test_inds)]
# this value is not zero
len(set(df_val.cl_uid).intersection(set(df_test.cl_uid)))
Any idea what could be going on?
sklearn
version 0.24.1 and Python
version 3.6
CodePudding user response:
The return of GroupShuffleSplit
is the array indices so if you want to split your DataFrame you should use .iloc
to filter.
df_val = df.iloc[val_inds]
df_test = df.iloc[test_inds]
If you mistakenly try to use the index
to filter, then you are assuming that you have a non-duplicated RangeIndex
that begins at 0. If that is not the case this filtering is bound to fail.
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit
# DataFrame with a non-RangeIndex
df = pd.DataFrame({'clust_id': [1,1,1,2,2,2,2]}, index=[1,2,1,2,1,2,3])
val_inds, test_inds = next(GroupShuffleSplit(test_size=0.5, n_splits=2,).split(df, groups=df['clust_id']))
Correct splitting
df_val = df.iloc[val_inds]
# clust_id
#2 2
#1 2
#2 2
#3 2
df_test = df.iloc[test_inds]
# clust_id
#1 1
#2 1
#1 1
Incorrect splitting, confuses index labels with array-position labels
df[df.index.isin(val_inds)]
# clust_id
#3 2
df[df.index.isin(test_inds)]
# clust_id
#1 1
#2 1
#1 1
#2 2
#1 2
#2 2