Home > Enterprise >  Perform groupby one hot encoding in pandas
Perform groupby one hot encoding in pandas

Time:09-13

I have a data frame as shown below

df:

cust_id         product
1               tv
1               phone
2               bat
2               ball
3               bat
4               ball
4               bat
4               tv
4               phone
5               tv
6               bat
7               bat
7               ball
7               tv
7               phone
8               phone
8               tv

from the above I would like to prepare below which basically perform groupby cust_id and one hot encoding on the column product.

Expected output:

product       ball      bat       phone     tv
cust_id      
1             0         0         1         1
2             1         1         0         0
3             0         1         0         0
4             1         1         1         1
5             0         0         0         1
6             0         1         0         0
7             1         1         1         1
8             0         0         1         1 

I have tried following code but that did not work.

try-1:

one_hot = pd.get_dummies(df.groupby('cust_id')['product'])
one_hot

try-2

df.pivot_table(index=['cust_id'], 
                      columns='product', 
                      values='product')

try-3

df1 = (df.sort_values(['cust_id', 'product'])
         .groupby('cust_id')['product'].agg(list)
         .reset_index(name='product_list')
      )

df1

CodePudding user response:

pd.crosstab:

pd.crosstab(df['cust_id'],df['product'])

product  ball  bat  phone  tv
cust_id                      
1           0    0      1   1
2           1    1      0   0
3           0    1      0   0
4           1    1      1   1
5           0    0      0   1
6           0    1      0   0
7           1    1      1   1
8           0    0      1   1

CodePudding user response:

Because hot encoding alwayes return 0,1 as necessary aggregate max:

one_hot = pd.get_dummies(df.set_index('cust_id')['product']).groupby(level=0).max()

And if count values per columns need clip for maximal 1 value:

one_hot = pd.crosstab(df["cust_id"], df["product"]).clip(upper=1)

Sample:

print (df)
    cust_id product
0         1    ball
1         1    ball
2         1    ball
3         1   phone
4         2     bat
5         2    ball
6         3     bat
7         4    ball
8         4     bat
9         4     bat
10        4      tv
11        4   phone

one_hot = pd.get_dummies(df.set_index('cust_id')['product']).groupby(level=0).max()
print (one_hot)
         ball  bat  phone  tv
cust_id                      
1           1    0      1   0
2           1    1      0   0
3           0    1      0   0
4           1    1      1   1

#count values - possible another integers like 0,1
print (pd.crosstab(df["cust_id"], df["product"]))
product  ball  bat  phone  tv
cust_id                      
1           3    0      1   0
2           1    1      0   0
3           0    1      0   0
4           1    2      1   1

one_hot = pd.crosstab(df["cust_id"], df["product"]).clip(upper=1)
print (one_hot)
product  ball  bat  phone  tv
cust_id                      
1           1    0      1   0
2           1    1      0   0
3           0    1      0   0
4           1    1      1   1

CodePudding user response:

Try:

x = pd.crosstab(df["cust_id"], df["product"])
print(x)

Prints:

product  ball  bat  phone  tv
cust_id                      
1           0    0      1   1
2           1    1      0   0
3           0    1      0   0
4           1    1      1   1
5           0    0      0   1
6           0    1      0   0
7           1    1      1   1
8           0    0      1   1
  • Related