Home > front end >  Combining a pandas DataFrame of lists, into a "one-hot" with values from a column of lists
Combining a pandas DataFrame of lists, into a "one-hot" with values from a column of lists

Time:03-06

How do I get something similar to a one-hot encoding, except for 1s and 0s, returning values from a list of the same size as the labels?

As an example DataFrame, with Sklearn's MultiLabelBinarizer I can turn the following DataFrame of lists:

df = pd.DataFrame({'labels': [['A', 'B'], ['B', 'D', 'F'], 
                              ['A', 'C', 'D'], ['D', 'B']],
                   'values': [[13.2, 1.3], [7.9, 12, 3.2], 
                              [9.0, 1, 14.8], [3.1, 6.7]]})

      labels          values
0     [A, B]     [13.2, 1.3]
1  [B, D, F]  [7.9, 12, 3.2]
2  [A, C, D]  [9.0, 1, 14.8]
3     [D, B]      [3.1, 6.7]

mlb = MultiLabelBinarizer(sparse_output=True)

df = df.join(
    pd.DataFrame.sparse.from_spmatrix(
        mlb.fit_transform(df.pop('labels')),
        index=df.index,
        columns=mlb.classes_))

To:

           values  A  B  C  D  F
0     [13.2, 1.3]  1  1  0  0  0
1  [7.9, 12, 3.2]  0  1  0  1  1
2  [9.0, 1, 14.8]  1  0  1  1  0
3      [3.1, 6.7]  0  1  0  1  0

While I am looking for a way to get this:

    A     B     C     D     F
0   13.2  1.3   0     0     0
1   0     7.9   0     12.0  3.2
2   9.0   0     1.    14.8  0
3   0     6.7   0     3.1   0

CodePudding user response:

You could explode reset_index pivot fillna:

out = (pd.pivot(df.explode(df.columns.tolist()).reset_index(), 
                'index', 'labels', 'values')
       .fillna(0)
       .rename_axis(index=[None], columns=[None]))

Output:

      A    B  C     D    F
0  13.2  1.3  0   0.0  0.0
1   0.0  7.9  0  12.0  3.2
2   9.0  0.0  1  14.8  0.0
3   0.0  6.7  0   3.1  0.0

CodePudding user response:

Create sample data

df = pd.DataFrame()
df['labels'] = [['A', 'B'], ['B', 'D', 'F'], ['A', 'C', 'D'], ['D', 'B']]
df['values'] = [[13.2, 1.3], [7.9, 12, 3.2], [9.0, 1, 14.8], [3.1, 6.7]]

List comprehension to loop over each row value and combine them to create a dictionary

transform =  [dict(zip(label, value)) for label, value in zip(df['labels'], df['values'])]

Put new data into a dataframe

new_df = pd.DataFrame(transform)

Replace nan with 0

new_df.replace(np.nan, 0)
  • Related