Home > Blockchain >  Pandas groupby of specific catergorical column
Pandas groupby of specific catergorical column

Time:09-07

With reference to Pandas groupby with categories with redundant nan

import pandas as pd

df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TYPE"] = df["TYPE"].astype("category")
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()

              VALUE
TEAM ID TYPE
1    1  A         1
        B         1
     2  A         1
        B         1
     4  A         0
        B         0
     5  A         0
        B         0
     8  A         0
        B         0
2    1  A         0
        B         0
     2  A         0
        B         0
     4  A         1
        B         0
     5  A         1
        B         0
     8  A         1
        B         0

Expected output

              VALUE
TEAM ID TYPE
1    1  A         1
        B         1
     2  A         1
        B         1
2    4  A         1
        B         0
     5  A         1
        B         0
     8  A         1
        B         0

I tried to use astype("category") for TYPE. However it seems to output every cartesian product of every item in every group.

CodePudding user response:

What you want is a little abnormal, but we can force it there from a pivot table:

out = df.pivot_table(index=['TEAM', 'ID'], 
                     columns=['TYPE'],  
                     values=['VALUE'], 
                     aggfunc='sum', 
                     observed=True, # This is the key when working with categoricals~
                                    # You should known to try this with your groupby from the post you linked...
                     fill_value=0).stack()
print(out)

Output:

              VALUE
TEAM ID TYPE
1    1  A         1
        B         1
     2  A         1
        B         1
2    4  A         1
        B         0
     5  A         1
        B         0
     8  A         1
        B         0

CodePudding user response:

here is one way to do it, based on the data you shared

reset the index and then do the groupby to choose groups where sum is greater than 0, means either of the category A or B is non-zero. Finally set the index

df.reset_index(inplace=True)

(df[df.groupby(['TEAM','ID'])['VALUE']
    .transform(lambda x: x.sum()>0)]
    .set_index(['TEAM','ID','TYPE']))

                 VALUE
TEAM    ID  TYPE    
1     1     A      1
            B      1
      2     A      1
            B      1
2     4     A      1
            B      0
      5     A      1
            B      0
      8     A      1
            B      0
  • Related