Expand Pandas series into dataframe by unique values-CodePudding

I would like to expand a series or dataframe into a sparse matrix based on the unique values of the series. It's a bit hard to explain verbally but an example should be clearer.

First, simpler version - if I start with this:

Idx  Tag
0    A
1    B
2    A
3    C
4    B

I'd like to get something like this, where the unique values in the starting series are the column values here (could be 1s and 0s, Boolean, etc.):

Idx  A   B   C
0    1   0   0
1    0   1   0
2    1   0   0
3    0   0   1
4    0   1   0

Second, more advanced version - if I have values associated with each entry, preserving those and filling the rest of the matrix with a placeholder (0, NaN, something else), e.g. starting from this:

Idx  Tag  Val
0    A    5
1    B    2
2    A    3
3    C    7
4    B    1

And ending up with this:

Idx  A   B   C
0    5   0   0
1    0   2   0
2    3   0   0
3    0   0   7
4    0   1   0

What's a Pythonic way to do this?

CodePudding user response：

Here's how to do it, using pandas.get_dummies() which was designed specifically for this (often called "one-hot-encoding" in ML). I've done it step-by-step so you can see how it's done ;)

>>> df
   Idx Tag  Val
0    0   A    5
1    1   B    2
2    2   A    3
3    3   C    7
4    4   B    1

>>> pd.get_dummies(df['Tag'])
   A  B  C
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1
4  0  1  0

>>> pd.concat([df[['Idx']], pd.get_dummies(df['Tag'])], axis=1)
   Idx  A  B  C
0    0  1  0  0
1    1  0  1  0
2    2  1  0  0
3    3  0  0  1
4    4  0  1  0

>>> pd.get_dummies(df['Tag']).to_numpy()
array([[1, 0, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0]], dtype=uint8)

>>> df2[['Val']].to_numpy()
array([[5],
       [2],
       [3],
       [7],
       [1]])

>>> pd.get_dummies(df2['Tag']).to_numpy() * df2[['Val']].to_numpy()
array([[5, 0, 0],
       [0, 2, 0],
       [3, 0, 0],
       [0, 0, 7],
       [0, 1, 0]])

>>> pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())
   A  B  C
0  5  0  0
1  0  2  0
2  3  0  0
3  0  0  7
4  0  1  0

>>> pd.concat([df, pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())], axis=1)
   Idx Tag  Val  A  B  C
0    0   A    5  5  0  0
1    1   B    2  0  2  0
2    2   A    3  3  0  0
3    3   C    7  0  0  7
4    4   B    1  0  1  0

CodePudding user response：

Based on @user17242583 's answer, found a pretty simple way to do it using pd.get_dummies combined with DataFrame.multiply:

>>> df
   Tag  Val
0   A   5
1   B   2
2   A   3
3   C   7
4   B   1

>>> pd.get_dummies(df['Tag'])
    A   B   C
0   1   0   0
1   0   1   0
2   1   0   0
3   0   0   1
4   0   1   0

>>> pd.get_dummies(df['Tag']).multiply(df['Val'], axis=0)
    A   B   C
0   5   0   0
1   0   2   0
2   3   0   0
3   0   0   7
4   0   1   0