Home > Blockchain >  One hot encoding with duplicate columns
One hot encoding with duplicate columns

Time:08-13

I have a dataset with City and Province Name, but here in Belgium we have Provice Limburg and also City Limburg. So when I try to train the model, I get this error:

[LightGBM] [Fatal] Feature (Limbourg) appears more than one time.

I do hot encoding like this:

import pandas as pd 
#One Hot Encoding of the Categorical features 
one_hot_city_name=pd.get_dummies(data.city_name) 
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building) 
one_hot_province=pd.get_dummies(data.province) 
one_hot_region=pd.get_dummies(data.region)

How can I solve this? maybe appending Province in the hotencoding Column? but how?

CodePudding user response:

Given you are OHE the features and using them as input for LightGBM, it won't hurt to rename the conflicting values, or slightly modify them to avoid any issues. Therefore I would suggest to just proceed with:

import pandas as pd 
#One Hot Encoding of the Categorical features 
one_hot_city_name=pd.get_dummies(data.city_name.rename({'Limburg':'Limburg (city)'}) 
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building) 
one_hot_province=pd.get_dummies(data.province) 
one_hot_region=pd.get_dummies(data.region)

CodePudding user response:

I'd use the prefix argument in the get_dummies function, to name the features with a user-defined prefix like this:

data = pd.DataFrame({'city': ['a', 'b', 'a'], 'province': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
one_hot_city = pd.get_dummies(data.city, prefix='city')
one_hot_province = pd.get_dummies(data.province, prefix='province')
  • Related