I am trying to get dummies for a variable for which data is split into multiple columns.
Input Data:
fruit_1 fruit_2 fruit_3 fruit_4 fruit_5
Index
person1 Apple NaN NaN NaN NaN
person2 Apple Banana Guava NaN NaN
person3 Guava NaN NaN NaN NaN
person4 Banana NaN NaN NaN NaN
person5 Apple Banana Guava Kiwi Mango
person6 Kiwi Mango NaN NaN NaN
Desired Output:
Apple Banana Guava Kiwi Mango
Index
person1 1 0 0 0 0
person2 1 1 1 0 0
person3 0 0 1 0 0
person4 0 1 0 0 0
person5 1 1 1 1 1
person6 0 0 0 1 1
In most of the approaches, I have tried the NaN/blank is causing issues as the number of values in each row can be anything from 1 to 5. I am using pandas to do so. Thank you for your help.
CodePudding user response:
Use get_dummies
by all columns with aggregate max
by duplicated columns names:
df = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).max()
print (df)
Apple Banana Guava Kiwi Mango
person1 1 0 0 0 0
person2 1 1 1 0 0
person3 0 0 1 0 0
person4 0 1 0 0 0
person5 1 1 1 1 1
person6 0 0 0 1 1
Or reshape first by DataFrame.stack
, then aggregate max
by index, first level:
df = pd.get_dummies(df.stack()).groupby(level=0).max()
print (df)
Apple Banana Guava Kiwi Mango
person1 1 0 0 0 0
person2 1 1 1 0 0
person3 0 0 1 0 0
person4 0 1 0 0 0
person5 1 1 1 1 1
person6 0 0 0 1 1
CodePudding user response:
One option is to apply value_counts
row wise and fill any nulls
df.apply(pd.Series.value_counts, axis = 1).fillna(0, downcast='infer')
Apple Banana Guava Kiwi Mango
Index
person1 1 0 0 0 0
person2 1 1 1 0 0
person3 0 0 1 0 0
person4 0 1 0 0 0
person5 1 1 1 1 1
person6 0 0 0 1 1
You could do a cross tab, which requires a conversion to long form (a step the solution above avoids, hopefully offering more efficiency):
box = df.melt(ignore_index = False)
pd.crosstab(box.index, box.value)
value Apple Banana Guava Kiwi Mango
row_0
person1 1 0 0 0 0
person2 1 1 1 0 0
person3 0 0 1 0 0
person4 0 1 0 0 0
person5 1 1 1 1 1
person6 0 0 0 1 1
Note that cross tab is not really speedy, but is quite convenient.