Home > front end >  Succinct / pythonic method of groupby percentage calculation
Succinct / pythonic method of groupby percentage calculation

Time:11-09

I want to groupby and calculate the percentage within each respective group over a certain value. I have achieved this by two groupbys into temporary dataframes (group count over a value, and group count), then merging the dataframes before calculating the percentage. I assume there is a more succinct method that doesn't require creating temporary dataframes.

e.g.

df = pd.DataFrame({'n': ['a', 'b', 'c', 'a', 'c', 'a', 'a', 'c', 'a'],
                    'm': ['j', 'j', 'j', 'j', 'j', 'j', 'f', 'f', 'f'],
                    'v': [1, 1, 1, 2, 1, 2, 3, 6, 5]})
  1. Count groupby on 'n' and 'm' and count
dfcount = df.groupby(['n', 'm'])['v'].agg('count').reset_index().rename(columns={'v': 'count'})
  1. Filter values over 1, then groupby 'n' and 'm' and count
dfover1 = df[df['v'] > 1].groupby(['n','m']).count().reset_index().rename(columns={'v': 'over1count'})
  1. Merge two new dataframes on the groupby columns 'n' and 'm', then calculate the percentage over 1 for those groupbys.
dfnew = dfcount.merge(dfover1, on=['n', 'm'])
dfnew['late%'] = (100 * dfnew['over1count'] / dfnew['count']).round(1)
n m count latecount late%
0 a f 2 2 100.0
1 a j 3 2 66.7
2 c f 1 1 100.0

CodePudding user response:

You can use method chaining (pipeline) to avoid using named intermediates:

out = (df
   .groupby(['n', 'm'], as_index=False)
   # compute count and latecount
   .agg(**{'count': ('v', 'count'),
           'latecount': ('v', lambda v: len(v[v>1]))
          })
   # filter rows with latecount > 0
   .query('latecount > 0')
   # compute late%
   .assign(**{'late%': lambda d: d['latecount'].div(d['count']).mul(100).round(2)})
)

output:

   n  m  count  latecount   late%
0  a  f      2          2  100.00
1  a  j      3          2   66.67
3  c  f      1          1  100.00

CodePudding user response:

You can create helper column later for count if later v values for improve performance instead filtering per groups, for count Trues aggregate sum:

dfnew = (df.assign(later = df['v'] > 1)
           .groupby(['n', 'm'], as_index=False)
           .agg(count=('v', 'count'),
                over1count=('later', 'sum'))
           .query('over1count > 0'))

dfnew['late%'] = (100 * dfnew['over1count'] / dfnew['count']).round(1)

print (dfnew)
   n  m  count  over1count  late%
0  a  f      2           2  100.0
1  a  j      3           2   66.7
3  c  f      1           1  100.0   
  • Related