I want to groupby and calculate the percentage within each respective group over a certain value. I have achieved this by two groupbys into temporary dataframes (group count over a value, and group count), then merging the dataframes before calculating the percentage. I assume there is a more succinct method that doesn't require creating temporary dataframes.
e.g.
df = pd.DataFrame({'n': ['a', 'b', 'c', 'a', 'c', 'a', 'a', 'c', 'a'],
'm': ['j', 'j', 'j', 'j', 'j', 'j', 'f', 'f', 'f'],
'v': [1, 1, 1, 2, 1, 2, 3, 6, 5]})
- Count groupby on 'n' and 'm' and count
dfcount = df.groupby(['n', 'm'])['v'].agg('count').reset_index().rename(columns={'v': 'count'})
- Filter values over 1, then groupby 'n' and 'm' and count
dfover1 = df[df['v'] > 1].groupby(['n','m']).count().reset_index().rename(columns={'v': 'over1count'})
- Merge two new dataframes on the groupby columns 'n' and 'm', then calculate the percentage over 1 for those groupbys.
dfnew = dfcount.merge(dfover1, on=['n', 'm'])
dfnew['late%'] = (100 * dfnew['over1count'] / dfnew['count']).round(1)
n | m | count | latecount | late% | |
---|---|---|---|---|---|
0 | a | f | 2 | 2 | 100.0 |
1 | a | j | 3 | 2 | 66.7 |
2 | c | f | 1 | 1 | 100.0 |
CodePudding user response:
You can use method chaining (pipeline) to avoid using named intermediates:
out = (df
.groupby(['n', 'm'], as_index=False)
# compute count and latecount
.agg(**{'count': ('v', 'count'),
'latecount': ('v', lambda v: len(v[v>1]))
})
# filter rows with latecount > 0
.query('latecount > 0')
# compute late%
.assign(**{'late%': lambda d: d['latecount'].div(d['count']).mul(100).round(2)})
)
output:
n m count latecount late%
0 a f 2 2 100.00
1 a j 3 2 66.67
3 c f 1 1 100.00
CodePudding user response:
You can create helper column later
for count if later v
values for improve performance instead filtering per groups, for count True
s aggregate sum:
dfnew = (df.assign(later = df['v'] > 1)
.groupby(['n', 'm'], as_index=False)
.agg(count=('v', 'count'),
over1count=('later', 'sum'))
.query('over1count > 0'))
dfnew['late%'] = (100 * dfnew['over1count'] / dfnew['count']).round(1)
print (dfnew)
n m count over1count late%
0 a f 2 2 100.0
1 a j 3 2 66.7
3 c f 1 1 100.0