I am trying to split my data set into train and test sets by using:
for train_set, test_set in stratified.split(complete_df, complete_df["loan_condition_int"]):
stratified_train = complete_df.loc[train_set]
stratified_test = complete_df.loc[test_set]
My dataframe complete_df
does not have any NaN
value. I make sured it by using complete_df.isnull().sum().max()
which returned 0
.
But I still get a warning saying:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
And it leads to an error later. I tried to use some techniques I found online but it does not still fix it.
CodePudding user response:
First, you should clarify what is stratified
. I'm assuming it's a sklearn's StratifiedShuffleSplit
object.
my data set complete_df does not have any NAN value.
"missing labels" from the warning message don't refer to missing values, i.e. NaNs. The error is saying that train_set
and/ or test_set
contain values (labels) that are not present in the index of complete_df
. That's because .loc
performs indexing based on row (and column) labels, not row position, while train_set
and test_set
indicate the row numbers. So if the index of your DataFrame doesn't coincide with the integer locations of the rows, which seems the case, the warning is raised.
To select by row position, use iloc
. This should work
for train_set, test_set in stratified.split(complete_df, complete_df["loan_condition_int"]):
stratified_train = complete_df.iloc[train_set]
stratified_test = complete_df.iloc[test_set]