I have the following pandas dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"bird_type": ["falcon", "crane", "crane", "falcon"],
"avg_speed": [np.random.randint(50, 200) for _ in range(4)],
"no_of_birds_observed": [np.random.randint(3, 10) for _ in range(4)],
"reliability_of_data": [np.random.rand() for _ in range(4)],
}
)
# The dataframe looks like this.
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 66 3 0.553841
1 crane 159 8 0.472359
2 crane 158 7 0.493193
3 falcon 161 7 0.585865
Now, I would like to have the weighted average (according to the number_of_birds_surveyed) for the average_speed and reliability variables. For that I have a simple function as follows, which calculates the weighted average.
def func(data, numbers):
ans = 0
for a, b in zip(data, numbers):
ans = ans a*b
ans = ans / sum(numbers)
return ans
How can I apply the function of func
to both average speed and reliability variables?
I expect the answer to be a dataframe like follows
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 132.5 10 0.5762578
# how (66*3 161*7)/(3 7) (3 10) (0.553841×3 0.585865×7)/(3 7)
1 crane 158.53 15 0.4820815
# how (159*8 158*7)/(8 7) (8 7) (0.472359×8 0.493193×7)/(8 7)
I saw this question, but could not generalize the solution / understand it completely. I thought of not asking the question, but according to this blog post by SO and this meta question, with a different example, I think this question can be considered a "borderline duplicate". An answer will benefit me and probably some others will also find this useful. So finally decided to ask.
CodePudding user response:
If want aggregate by GroupBy.agg
for weights
parameter is used no_of_birds_observed
by DataFrame.loc
:
#for correct ouput need default (or unique values) index
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights= df.loc[x.index, 'no_of_birds_observed'])
df1 = (df.groupby('bird_type', sort=False, as_index=False)
.agg(avg=('avg_speed',f),
no_of_birds=('no_of_birds_observed','sum'),
reliability_of_data=('reliability_of_data', f)))
print (df1)
bird_type avg no_of_birds reliability_of_data
0 falcon 132.500000 10 0.576258
1 crane 158.533333 15 0.482082
CodePudding user response:
Don't use a function with apply
, rather perform a classical aggregation:
cols = ['avg_speed', 'reliability_of_data']
# multiply relevant columns by no_of_birds_observed
# aggregate everything as sum
out = (df[cols].mul(df['no_of_birds_observed'], axis=0)
.combine_first(df)
.groupby('bird_type').sum()
)
# divide the relevant columns by the sum of no_of_birds_observed
out[cols] = out[cols].div(out['no_of_birds_observed'], axis=0)
Output:
avg_speed no_of_birds_observed reliability_of_data
bird_type
crane 158.533333 15 0.482082
falcon 132.500000 10 0.576258