Home > Software engineering >  Pandas group_by multiple columns with conditional count
Pandas group_by multiple columns with conditional count

Time:09-09

I have a dataframe, for instance:

df = pd.DataFrame({'Host': {0: 'N',
  1: 'B',
  2: 'N',
  3: 'N',
  4: 'N',
  5: 'V',
  6: 'B'},
 'Registration': {0: 'Registered',
  1: 'MR',
  2: 'Registered',
  3: 'Registered',
  4: '',
  5: 'Registered',
  6: 'Registered'},
 'Val': {0: 'N',
  1: 'B',
  2: 'N',
  3: 'N',
  4: '',
  5: 'V',
  6: 'B'},
 'Sum': {0: 100.0,
  1: 0.0,
  2: 300.0,
  3: 150.0,
  4: 0.0,
  5: 0.0,
  6: 20.0}})

I want to get the count, for each Host. Something like:

df.groupby("Host").count()
"""
Host Registration   Val  Sum
            
B      2              2   2
N      4              4   4
V      1              1   1
"""

But I want it conditional as a function of each column. For example, I want to count in Sum, only those rows that have more than 0.0, and in the others the ones that are not empty. So my expected output would be:

Host Registration   Val  Sum
            
B      2              2   1
N      3              3   3
V      1              1   0
"""

Not sure how to do that. My best attempt has been:

 df.groupby("Host").agg({'Registration': lambda x: (x != "").count(), 
                         'Val':lambda x: (x != "").count(),
                         'Sum': lambda x: (x != 0).count()})

But this produces the same output as df.groupby("Host").count()

Any suggestion?

CodePudding user response:

First your solution - for count Trues values use sum:

df = df.groupby("Host").agg({'Registration': lambda x: (x != "").sum(), 
                         'Val':lambda x: (x != "").sum(),
                         'Sum': lambda x: (x != 0).sum()})

print (df)
      Registration  Val  Sum
Host                        
B                2    2  1.0
N                3    3  3.0
V                1    1  0.0

Improved solution - create boolean columns before aggregation sum:

df = df.assign(Registration = df['Registration'].ne(""),
               Val = df['Val'].ne(""),
               Sum = df['Sum'].ne(0)).groupby("Host").sum()

print (df)
      Registration  Val  Sum
Host                        
B                2    2    1
N                3    3    3
V                1    1    0
  • Related