Home > Net >  Create feature matrix from Dataframe
Create feature matrix from Dataframe

Time:11-04

I would like to transform a dataframe into a feature matrix (actually, I'm not sure it is called a feature matrix).

df = pd.DataFrame({'Car': ['Audi', 'Toyota', 'Chrysler', 'Toyota', 'Chrysler', 'Chrysler'], 
                   'Color': ['red', 'red', 'blue', 'silver', 'blue', 'silver']})
    Car         Color
0   Audi            red
1   Toyota          red
2   Chrysler    blue
3   Toyota          silver
4   Chrysler    blue
5   Chrysler    silver

I would like to create a matrix with cars and colors as index and columns where a True, or 1 shows a possible combination like follows:

    Color   Audit   Chrysler  Toyota
0   blue    0   1     0
1   red 1   0     1
2   silver  0   1     1

I can create a matrix and then iterate over the rows and enter the values, but this takes quite long. Is there a better way to create this matrix?

Kind regards, Stephan

CodePudding user response:

pivot_table would seem to apply here:

df.pivot_table(index="Car", columns="Color", aggfunc=len)

Which gives:

Color       blue    red     silver
Car         
Audi        NaN     1.0     NaN
Chrysler    2.0     NaN     1.0
Toyota      NaN     1.0     1.0

You specify the vertical component as the index column (Car), and the horizontal one as the columns component (Color), then provide a function to fill the cells (len).

Then, to nuance it a little, you could use fillna() to "paint" the empty cells with zeros. And apply a logical test to show which ones are "possible"

e.g.

df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0)>0

Which gives:

Color       blue    red     silver
Car         
Audi        False   True    False
Chrysler    True    False   True
Toyota      False   True    True

And as a final bit of polish, having learned about it from here, you could run an applymap to get your 0,1 output:

(df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0)>0).applymap(lambda x : 1 if x==True else 0)

Giving:

Color       blue    red     silver
Car         
Audi        0       1       0
Chrysler    1       0       1
Toyota      0       1       1

Finally, this process is sometimes referred to in the literature as One Hot Encoding and there are some cool implementations such as this one from sklearn in case your investigations lead you in that direction.

CodePudding user response:

In extension to Thomas's answer below code should give exactly what you desire in the output

import pandas as pd

df = pd.DataFrame({'Car': ['Audi', 'Toyota', 'Chrysler', 'Toyota', 'Chrysler', 'Chrysler'], 
                   'Color': ['red', 'red', 'blue', 'silver', 'blue', 'silver']})

output = (df.pivot_table(index="Car", columns="Color", aggfunc=len).fillna(0).T > 0).astype(int)
print(output)

Car     Audi  Chrysler  Toyota
Color                         
blue       0         1       0
red        1         0       1
silver     0         1       1
  • Related