I have a dataset with City and Province Name, but here in Belgium we have Provice Limburg and also City Limburg. So when I try to train the model, I get this error:
[LightGBM] [Fatal] Feature (Limbourg) appears more than one time.
I do hot encoding like this:
import pandas as pd
#One Hot Encoding of the Categorical features
one_hot_city_name=pd.get_dummies(data.city_name)
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building)
one_hot_province=pd.get_dummies(data.province)
one_hot_region=pd.get_dummies(data.region)
How can I solve this? maybe appending Province in the hotencoding Column? but how?
CodePudding user response:
Given you are OHE the features and using them as input for LightGBM, it won't hurt to rename the conflicting values, or slightly modify them to avoid any issues. Therefore I would suggest to just proceed with:
import pandas as pd
#One Hot Encoding of the Categorical features
one_hot_city_name=pd.get_dummies(data.city_name.rename({'Limburg':'Limburg (city)'})
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building)
one_hot_province=pd.get_dummies(data.province)
one_hot_region=pd.get_dummies(data.region)
CodePudding user response:
I'd use the prefix
argument in the get_dummies function, to name the features with a user-defined prefix like this:
data = pd.DataFrame({'city': ['a', 'b', 'a'], 'province': ['b', 'a', 'c'],
'C': [1, 2, 3]})
one_hot_city = pd.get_dummies(data.city, prefix='city')
one_hot_province = pd.get_dummies(data.province, prefix='province')