Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can u-CodePudding

I am trying to split my data set into train and test sets by using:

for train_set, test_set in stratified.split(complete_df, complete_df["loan_condition_int"]):
    stratified_train = complete_df.loc[train_set]
    stratified_test = complete_df.loc[test_set]

My dataframe complete_df does not have any NaN value. I make sured it by using complete_df.isnull().sum().max() which returned 0.

But I still get a warning saying:

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

And it leads to an error later. I tried to use some techniques I found online but it does not still fix it.

CodePudding user response：

First, you should clarify what is stratified. I'm assuming it's a sklearn's StratifiedShuffleSplit object.

my data set complete_df does not have any NAN value.

"missing labels" from the warning message don't refer to missing values, i.e. NaNs. The error is saying that train_set and/ or test_set contain values (labels) that are not present in the index of complete_df. That's because .loc performs indexing based on row (and column) labels, not row position, while train_set and test_set indicate the row numbers. So if the index of your DataFrame doesn't coincide with the integer locations of the rows, which seems the case, the warning is raised.

To select by row position, use iloc. This should work

for train_set, test_set in stratified.split(complete_df, complete_df["loan_condition_int"]):
    stratified_train = complete_df.iloc[train_set]
    stratified_test = complete_df.iloc[test_set]