Cannot seem to pass pandas DataFrame into feature_engine.selection.DropHighPSIFeatures fit method co-CodePudding

I could not get the code to calculate psi values to work and I am not very familiar with feature_engine library or in general ML related operations.

The code I am currently trying to run is:

long_list = merge_into_df(oot_path, test_path, train_path, key_mapping_path)
long_list.drop(columns=['Unnamed: 0_x', 'CLIENT_ID', 'SET'], inplace=True)
long_list['REF_DATE'] = pd.to_datetime(long_list.REF_DATE)
print(long_list.head())
transformer = DropHighPSIFeatures(
    cut_off=pd.to_datetime("2019/09/30"),  # the cut_off date
    split_col='REF_DATE',  # the date variable
    strategy='equal_frequency',
    bins=8,
    threshold=0.1,
    missing_values='ignore'
)
transformer.fit_transform(long_list)
return transformer.psi_values_

The error message returning is:

  Traceback (most recent call last):
  File "C:\Users\Dell\Pipeline\modelling.py", line 124, in <module>
    test()
  File "C:\Users\Dell\Pipeline\modelling.py", line 98, in test
  File "C:\ProgramData\Miniconda3\lib\site-packages\feature_engine\selection\drop_psi_features.py", line 364, in fit
    test_discrete = bucketer.transform(test_df[[feature]].dropna())
  File "C:\ProgramData\Miniconda3\lib\site-packages\feature_engine\discretisation\base_discretiser.py", line 74, in transform
    X = super().transform(X)
  File "C:\ProgramData\Miniconda3\lib\site-packages\feature_engine\base_transformers.py", line 146, in transform
    X = check_X(X)
  File "C:\ProgramData\Miniconda3\lib\site-packages\feature_engine\dataframe_checks.py", line 82, in check_X
    raise ValueError(
ValueError: 0 feature(s) (shape=(0, 1)) while a minimum of 1 is required.

The dataframe print statement in the previous code snippet is:

ID  TARGET  GROUP_ID  BRANCH_ID  ...  SON_4_12AY_7_12AY_EKOD_1  SON_4_12AY_7_12AY_EKOD_U  Unnamed: 0_y   REF_DATE
0   0       0         0       1020  ...                         0                         0             0 2016-12-31
1   2       0         0       2280  ...                         0                         0             2 2016-12-31
2   3       0         0       1150  ...                         0                         0             3 2016-12-31
3   4       1         0       1000  ...                         0                         0             4 2016-12-31
4   5       0         0       1090  ...                         0                         0             5 2016-12-31

[5 rows x 1976 columns]

So I assumed I don't have anything problematic in the dataframe itself (apart from the Unnamed: 0_y column maybe)

However, just in case the method in which I create the dataframe from 3 long list csv files and a key mapping csv file is this:

train_df = pd.read_csv(train_path, low_memory=False)
test_df = pd.read_csv(test_path, low_memory=False)
oot_df = pd.read_csv(oot_path, low_memory=False)
key_mapping_df = pd.read_csv(key_mapping_path)

long_list_df = pd.concat([train_df, test_df, oot_df], axis=0)

long_list_final_df = long_list_df.merge(key_mapping_df, on="ID", how="inner", sort=True)

return long_list_final_df

CodePudding user response：

Turns out that the problem was caused by either the data on the the DataFrame (long_list) being too sparse (too many NaN values) or it being too large. I haven't done the experiment to figure out which one, but the problem was resolved when I dropped columns with a lot of NaN values.