One hot encoding with duplicate columns-CodePudding

I have a dataset with City and Province Name, but here in Belgium we have Provice Limburg and also City Limburg. So when I try to train the model, I get this error:

[LightGBM] [Fatal] Feature (Limbourg) appears more than one time.

I do hot encoding like this:

import pandas as pd 
#One Hot Encoding of the Categorical features 
one_hot_city_name=pd.get_dummies(data.city_name) 
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building) 
one_hot_province=pd.get_dummies(data.province) 
one_hot_region=pd.get_dummies(data.region)

How can I solve this? maybe appending Province in the hotencoding Column? but how?

CodePudding user response：

Given you are OHE the features and using them as input for LightGBM, it won't hurt to rename the conflicting values, or slightly modify them to avoid any issues. Therefore I would suggest to just proceed with:

import pandas as pd 
#One Hot Encoding of the Categorical features 
one_hot_city_name=pd.get_dummies(data.city_name.rename({'Limburg':'Limburg (city)'}) 
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building) 
one_hot_province=pd.get_dummies(data.province) 
one_hot_region=pd.get_dummies(data.region)

CodePudding user response：

I'd use the prefix argument in the get_dummies function, to name the features with a user-defined prefix like this:

data = pd.DataFrame({'city': ['a', 'b', 'a'], 'province': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
one_hot_city = pd.get_dummies(data.city, prefix='city')
one_hot_province = pd.get_dummies(data.province, prefix='province')