I have Pandas df of objects of different lenght:
data = {'object_1':[1, 3, 4, 50],
'object_2':[1, 50],
"object_3": [1, 3, 4, 5, 50],
"object_4": [3, 5, 47, 48]}
I would like to change the df so:
the total amount of columns is equal to the highest value in the df,
the columns have numeric names (from 1, highest int value of the objects),
the df also needs to be converted to binary in a special way: only cells of the columns with the same name like original object´s values will hold "1", rest will be "0".
So for example "object_3" row will be filled with "1" only in columns 1, 3, 4, 5, 50. Rest will be zeros.
Columns with all zeros will be dropped.
new_data = {'object_1':[1, 1, 1, 0, 0, 0, 1], 'object_2':[1, 0, 0, 0, 0, 0, 1], "object_3": [1, 1, 1, 1, 0, 0, 1], "object_4": [0, 1, 0, 1, 1, 1, 0]}
...names of the columns are therefor: "1", "3", "4", "5", "47", "48", "50".
The key is: placing "1" in cells of new columns that refer to the int object´s values.
This all could be called "back-off matrix, I find some inspiration here: Pandas data frame: convert Int column into binary in python
I would prefer numpy solution over for loops but will be happy with anything that works.
Thanks.
CodePudding user response:
This way of representing data is called One Hot Encoding and can be achieved using pandas.get_dummies. You have two good solutions in the two most accepted answers of this thread.
Choosing the first one for simplicity, here is a solution adapted to your question :
import pandas as pd
s = pd.Series(data)
df = pd.get_dummies(s.apply(pd.Series).stack().astype(int)).sum(level=0)