Let's say I have the following df:
data = [{'c1':a, 'c2':x}, {'c1':b,'c2':y}, {'c1':c,'c2':z}]
df = pd.DataFrame(data)
Output:
c1 c2
0 a x
1 b y
2 c z
Now I want to use pd.get_dummies() to one hot encode the two categorical columns c1 and c2 and drop the first category of each col pd.get_dummies(df, columns = ['c1', 'c2'], drop_first=True)
. How can I decide which category to drop, without knowing the rows' order? Is there any command I missed?
EDIT:
So my goal would be to e.g., drop category b
from c1
and z
from c2
Output:
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0
CodePudding user response:
One trick is replace values to NaN
s - here is removed one value per rows:
#columns with values for avoid
d = {'c1':'b', 'c2':'z'}
d1 = {k:{v: np.nan} for k, v in d.items()}
df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
print (df)
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0
If need multiple values for remove per column use lists like:
d = {'c1':['b','c'], 'c2':['z']}
d1 = {k:{x: np.nan for x in v} for k, v in d.items()}
print (d1)
{'c1': {'b': nan, 'c': nan}, 'c2': {'z': nan}}
df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
print (df)
a x y
0 1 1 0
1 0 0 1
2 0 0 0
EDIT:
If values are unique per columns simplier is them removed in last step:
df = (pd.get_dummies(df, columns = ['c1', 'c2'], prefix='', prefix_sep='')
.drop(['b','z'], axis=1))
print (df)
a c x y
0 1 0 1 0
1 0 0 0 1
2 0 1 0 0
CodePudding user response:
I'd highly recommend using sklearn
instead! https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
You can view the categories accessing the <your_fitted_instance_name>.categories_
attribute after you've fitted the one hot encoder, and it also has a inverse_transform()
function to reverse the one hot encoding!
As for column dropping.. the default is not to drop any. However, you can use OneHotEncoder(drop='first')
in order to drop one.
Edit: Also note that sklearn
offers Pipelines which can help you ensure consistent pre-processing throughout your project!
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html