I'm an R
and data.table
user trying to use more Python
and pandas
, and when I want to do the following in pandas
, I get so frustrated because there does not seem to be an easy way:
df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [0.2, 0.2, 0.2, 0.1], 'c': [10, 9, 10, 22], 'd': [2, 3, 3, 1]})
In R
if I wanted to sum the product of b
and c
and divide on the sum of d
, grouped on a
, I would just do:
df[, sum(b*c)/sum(d), by=a]
, but this seems hard for me to do elegantly in pandas
. I mean no offence, please don't hate me, it is probably just me not understanding pandas
yet.
Solution thus far:
df[["b", "c"]].prod(axis=1).groupby(df["a"]).sum()/df.groupby("a").d.sum()
CodePudding user response:
I would have done the same as you did.
If you want you can also use groupby.apply
:
df.groupby('a').apply(lambda g: (g['b']*g['c']).sum()/g['d'].sum())
output:
a
1 0.76
2 1.05
dtype: float64
CodePudding user response:
If performance is important, avoid groupby.apply
- create helper column new
, aggregate both and last divide both columns:
s = (df.assign(new = df[["b", "c"]].prod(axis=1))
.groupby("a")[['new','d']]
.sum()
.eval('new / d'))
print (s)
a
1 0.76
2 1.05
dtype: float64