I would like to expand a series or dataframe into a sparse matrix based on the unique values of the series. It's a bit hard to explain verbally but an example should be clearer.
First, simpler version - if I start with this:
Idx Tag
0 A
1 B
2 A
3 C
4 B
I'd like to get something like this, where the unique values in the starting series are the column values here (could be 1s and 0s, Boolean, etc.):
Idx A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
Second, more advanced version - if I have values associated with each entry, preserving those and filling the rest of the matrix with a placeholder (0, NaN, something else), e.g. starting from this:
Idx Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
And ending up with this:
Idx A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
What's a Pythonic way to do this?
CodePudding user response:
Here's how to do it, using pandas.get_dummies()
which was designed specifically for this (often called "one-hot-encoding" in ML). I've done it step-by-step so you can see how it's done ;)
>>> df
Idx Tag Val
0 0 A 5
1 1 B 2
2 2 A 3
3 3 C 7
4 4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.concat([df[['Idx']], pd.get_dummies(df['Tag'])], axis=1)
Idx A B C
0 0 1 0 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 0 1 0
>>> pd.get_dummies(df['Tag']).to_numpy()
array([[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]], dtype=uint8)
>>> df2[['Val']].to_numpy()
array([[5],
[2],
[3],
[7],
[1]])
>>> pd.get_dummies(df2['Tag']).to_numpy() * df2[['Val']].to_numpy()
array([[5, 0, 0],
[0, 2, 0],
[3, 0, 0],
[0, 0, 7],
[0, 1, 0]])
>>> pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
>>> pd.concat([df, pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())], axis=1)
Idx Tag Val A B C
0 0 A 5 5 0 0
1 1 B 2 0 2 0
2 2 A 3 3 0 0
3 3 C 7 0 0 7
4 4 B 1 0 1 0
CodePudding user response:
Based on @user17242583 's answer, found a pretty simple way to do it using pd.get_dummies
combined with DataFrame.multiply
:
>>> df
Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.get_dummies(df['Tag']).multiply(df['Val'], axis=0)
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0