How do I get something similar to a one-hot encoding, except for 1s and 0s, returning values from a list of the same size as the labels?
As an example DataFrame, with Sklearn's MultiLabelBinarizer
I can turn the following DataFrame of lists:
df = pd.DataFrame({'labels': [['A', 'B'], ['B', 'D', 'F'],
['A', 'C', 'D'], ['D', 'B']],
'values': [[13.2, 1.3], [7.9, 12, 3.2],
[9.0, 1, 14.8], [3.1, 6.7]]})
labels values
0 [A, B] [13.2, 1.3]
1 [B, D, F] [7.9, 12, 3.2]
2 [A, C, D] [9.0, 1, 14.8]
3 [D, B] [3.1, 6.7]
mlb = MultiLabelBinarizer(sparse_output=True)
df = df.join(
pd.DataFrame.sparse.from_spmatrix(
mlb.fit_transform(df.pop('labels')),
index=df.index,
columns=mlb.classes_))
To:
values A B C D F
0 [13.2, 1.3] 1 1 0 0 0
1 [7.9, 12, 3.2] 0 1 0 1 1
2 [9.0, 1, 14.8] 1 0 1 1 0
3 [3.1, 6.7] 0 1 0 1 0
While I am looking for a way to get this:
A B C D F
0 13.2 1.3 0 0 0
1 0 7.9 0 12.0 3.2
2 9.0 0 1. 14.8 0
3 0 6.7 0 3.1 0
CodePudding user response:
You could explode
reset_index
pivot
fillna
:
out = (pd.pivot(df.explode(df.columns.tolist()).reset_index(),
'index', 'labels', 'values')
.fillna(0)
.rename_axis(index=[None], columns=[None]))
Output:
A B C D F
0 13.2 1.3 0 0.0 0.0
1 0.0 7.9 0 12.0 3.2
2 9.0 0.0 1 14.8 0.0
3 0.0 6.7 0 3.1 0.0
CodePudding user response:
Create sample data
df = pd.DataFrame()
df['labels'] = [['A', 'B'], ['B', 'D', 'F'], ['A', 'C', 'D'], ['D', 'B']]
df['values'] = [[13.2, 1.3], [7.9, 12, 3.2], [9.0, 1, 14.8], [3.1, 6.7]]
List comprehension to loop over each row value and combine them to create a dictionary
transform = [dict(zip(label, value)) for label, value in zip(df['labels'], df['values'])]
Put new data into a dataframe
new_df = pd.DataFrame(transform)
Replace nan with 0
new_df.replace(np.nan, 0)