Home > OS >  scikit-learn train and test split returns NaNs
scikit-learn train and test split returns NaNs

Time:11-20

my sample data looks like below

customer_id   revenue_m10   revenue_m9   revenue_m8  target
   1             1234         1231        1256         1239
   2             5678         3425        3255         2345

I am trying to split my dataset into train and test based on scikit-learn's train_test_split module.

So, I tried the below code

X_train,X_test,y_train, y_test  = train_test_split(
    sample_set_df[all_features], 
    sample_set_df[target_var], 
    test_size=0.3
)

But when I view y_test, it looks like below with NaNs like below. Not sure what is the issue. Is the index number missing or any other issue?

if index is an issue, cam I know how can we solve this?

enter image description here

CodePudding user response:

y_test is a pandas Series, printing it displays its index and the data. It seems that sample_set_df has NaNs in its index.

Having NaNs in the index does not affect how train_test_split splits the data. You might have an issue with the actual data though. The target is 0 when you have NaNs.

  • Related