Home > Software engineering >  Pandas groupby and aggregate on some operation between columns
Pandas groupby and aggregate on some operation between columns

Time:05-03

I'm an R and data.tableuser trying to use more Python and pandas, and when I want to do the following in pandas, I get so frustrated because there does not seem to be an easy way:

df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [0.2, 0.2, 0.2, 0.1], 'c': [10, 9, 10, 22], 'd': [2, 3, 3, 1]})

In R if I wanted to sum the product of b and c and divide on the sum of d, grouped on a, I would just do:

df[, sum(b*c)/sum(d), by=a]

, but this seems hard for me to do elegantly in pandas. I mean no offence, please don't hate me, it is probably just me not understanding pandas yet.

Solution thus far:

df[["b", "c"]].prod(axis=1).groupby(df["a"]).sum()/df.groupby("a").d.sum()

CodePudding user response:

I would have done the same as you did.

If you want you can also use groupby.apply:

df.groupby('a').apply(lambda g: (g['b']*g['c']).sum()/g['d'].sum())

output:

a
1    0.76
2    1.05
dtype: float64

CodePudding user response:

If performance is important, avoid groupby.apply - create helper column new, aggregate both and last divide both columns:

s = (df.assign(new = df[["b", "c"]].prod(axis=1))
       .groupby("a")[['new','d']]
       .sum()
       .eval('new / d'))
print (s)
a
1    0.76
2    1.05
dtype: float64
  • Related