I could not get the code to calculate psi values to work and I am not very familiar with feature_engine library or in general ML related operations.
The code I am currently trying to run is:
long_list = merge_into_df(oot_path, test_path, train_path, key_mapping_path)
long_list.drop(columns=['Unnamed: 0_x', 'CLIENT_ID', 'SET'], inplace=True)
long_list['REF_DATE'] = pd.to_datetime(long_list.REF_DATE)
print(long_list.head())
transformer = DropHighPSIFeatures(
cut_off=pd.to_datetime("2019/09/30"), # the cut_off date
split_col='REF_DATE', # the date variable
strategy='equal_frequency',
bins=8,
threshold=0.1,
missing_values='ignore'
)
transformer.fit_transform(long_list)
return transformer.psi_values_
The error message returning is:
Traceback (most recent call last):
File "C:\Users\Dell\Pipeline\modelling.py", line 124, in <module>
test()
File "C:\Users\Dell\Pipeline\modelling.py", line 98, in test
File "C:\ProgramData\Miniconda3\lib\site-packages\feature_engine\selection\drop_psi_features.py", line 364, in fit
test_discrete = bucketer.transform(test_df[[feature]].dropna())
File "C:\ProgramData\Miniconda3\lib\site-packages\feature_engine\discretisation\base_discretiser.py", line 74, in transform
X = super().transform(X)
File "C:\ProgramData\Miniconda3\lib\site-packages\feature_engine\base_transformers.py", line 146, in transform
X = check_X(X)
File "C:\ProgramData\Miniconda3\lib\site-packages\feature_engine\dataframe_checks.py", line 82, in check_X
raise ValueError(
ValueError: 0 feature(s) (shape=(0, 1)) while a minimum of 1 is required.
The dataframe print statement in the previous code snippet is:
ID TARGET GROUP_ID BRANCH_ID ... SON_4_12AY_7_12AY_EKOD_1 SON_4_12AY_7_12AY_EKOD_U Unnamed: 0_y REF_DATE
0 0 0 0 1020 ... 0 0 0 2016-12-31
1 2 0 0 2280 ... 0 0 2 2016-12-31
2 3 0 0 1150 ... 0 0 3 2016-12-31
3 4 1 0 1000 ... 0 0 4 2016-12-31
4 5 0 0 1090 ... 0 0 5 2016-12-31
[5 rows x 1976 columns]
So I assumed I don't have anything problematic in the dataframe itself (apart from the Unnamed: 0_y column maybe)
However, just in case the method in which I create the dataframe from 3 long list csv files and a key mapping csv file is this:
train_df = pd.read_csv(train_path, low_memory=False)
test_df = pd.read_csv(test_path, low_memory=False)
oot_df = pd.read_csv(oot_path, low_memory=False)
key_mapping_df = pd.read_csv(key_mapping_path)
long_list_df = pd.concat([train_df, test_df, oot_df], axis=0)
long_list_final_df = long_list_df.merge(key_mapping_df, on="ID", how="inner", sort=True)
return long_list_final_df
CodePudding user response:
Turns out that the problem was caused by either the data on the the DataFrame (long_list) being too sparse (too many NaN values) or it being too large. I haven't done the experiment to figure out which one, but the problem was resolved when I dropped columns with a lot of NaN values.