I would like to transform a dataframe into a feature matrix (actually, I'm not sure it is called a feature matrix).
df = pd.DataFrame({'Car': ['Audi', 'Toyota', 'Chrysler', 'Toyota', 'Chrysler', 'Chrysler'],
'Color': ['red', 'red', 'blue', 'silver', 'blue', 'silver']})
Car Color
0 Audi red
1 Toyota red
2 Chrysler blue
3 Toyota silver
4 Chrysler blue
5 Chrysler silver
I would like to create a matrix with cars and colors as index and columns where a True
, or 1 shows a possible combination like follows:
Color Audit Chrysler Toyota
0 blue 0 1 0
1 red 1 0 1
2 silver 0 1 1
I can create a matrix and then iterate over the rows and enter the values, but this takes quite long. Is there a better way to create this matrix?
Kind regards, Stephan
CodePudding user response:
pivot_table
would seem to apply here:
df.pivot_table(index="Car", columns="Color", aggfunc=len)
Which gives:
Color blue red silver
Car
Audi NaN 1.0 NaN
Chrysler 2.0 NaN 1.0
Toyota NaN 1.0 1.0
You specify the vertical component as the index column (Car), and the horizontal one as the columns component (Color), then provide a function to fill the cells (len).
Then, to nuance it a little, you could use fillna() to "paint" the empty cells with zeros. And apply a logical test to show which ones are "possible"
e.g.
df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0)>0
Which gives:
Color blue red silver
Car
Audi False True False
Chrysler True False True
Toyota False True True
And as a final bit of polish, having learned about it from here, you could run an applymap
to get your 0,1 output:
(df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0)>0).applymap(lambda x : 1 if x==True else 0)
Giving:
Color blue red silver
Car
Audi 0 1 0
Chrysler 1 0 1
Toyota 0 1 1
Finally, this process is sometimes referred to in the literature as One Hot Encoding and there are some cool implementations such as this one from sklearn in case your investigations lead you in that direction.
CodePudding user response:
In extension to Thomas's answer below code should give exactly what you desire in the output
import pandas as pd
df = pd.DataFrame({'Car': ['Audi', 'Toyota', 'Chrysler', 'Toyota', 'Chrysler', 'Chrysler'],
'Color': ['red', 'red', 'blue', 'silver', 'blue', 'silver']})
output = (df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0).T > 0).astype(int)
print(output)
Car Audi Chrysler Toyota
Color
blue 0 1 0
red 1 0 1
silver 0 1 1