Home > Enterprise >  Getting dummies/encoding using multiple columns in pandas
Getting dummies/encoding using multiple columns in pandas

Time:11-29

I am trying to get dummies for a variable for which data is split into multiple columns.

Input Data:
       fruit_1  fruit_2 fruit_3 fruit_4 fruit_5
Index                   
person1 Apple   NaN     NaN     NaN     NaN
person2 Apple   Banana  Guava   NaN     NaN
person3 Guava   NaN     NaN     NaN     NaN
person4 Banana  NaN     NaN     NaN     NaN
person5 Apple   Banana  Guava   Kiwi    Mango
person6 Kiwi    Mango   NaN     NaN     NaN

Desired Output:

           Apple    Banana  Guava   Kiwi    Mango
Index                   
person1        1    0       0      0        0
person2        1    1       1      0        0
person3        0    0       1      0        0
person4        0    1       0      0        0
person5        1    1       1      1        1
person6        0    0       0      1        1

In most of the approaches, I have tried the NaN/blank is causing issues as the number of values in each row can be anything from 1 to 5. I am using pandas to do so. Thank you for your help.

CodePudding user response:

Use get_dummies by all columns with aggregate max by duplicated columns names:

df = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).max()
print (df)
         Apple  Banana  Guava  Kiwi  Mango
person1      1       0      0     0      0
person2      1       1      1     0      0
person3      0       0      1     0      0
person4      0       1      0     0      0
person5      1       1      1     1      1
person6      0       0      0     1      1

Or reshape first by DataFrame.stack, then aggregate max by index, first level:

df = pd.get_dummies(df.stack()).groupby(level=0).max()
print (df)
         Apple  Banana  Guava  Kiwi  Mango
person1      1       0      0     0      0
person2      1       1      1     0      0
person3      0       0      1     0      0
person4      0       1      0     0      0
person5      1       1      1     1      1
person6      0       0      0     1      1

CodePudding user response:

One option is to apply value_counts row wise and fill any nulls

df.apply(pd.Series.value_counts, axis = 1).fillna(0, downcast='infer')

         Apple  Banana  Guava  Kiwi  Mango
Index
person1      1       0      0     0      0
person2      1       1      1     0      0
person3      0       0      1     0      0
person4      0       1      0     0      0
person5      1       1      1     1      1
person6      0       0      0     1      1

You could do a cross tab, which requires a conversion to long form (a step the solution above avoids, hopefully offering more efficiency):

 box = df.melt(ignore_index = False)

pd.crosstab(box.index, box.value)

value    Apple  Banana  Guava  Kiwi  Mango
row_0
person1      1       0      0     0      0
person2      1       1      1     0      0
person3      0       0      1     0      0
person4      0       1      0     0      0
person5      1       1      1     1      1
person6      0       0      0     1      1

Note that cross tab is not really speedy, but is quite convenient.

  • Related