Applying custom functions to groupby objects pandas-CodePudding

I have the following pandas dataframe.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "bird_type": ["falcon", "crane", "crane", "falcon"],
        "avg_speed": [np.random.randint(50, 200) for _ in range(4)],
        "no_of_birds_observed": [np.random.randint(3, 10) for _ in range(4)],
        "reliability_of_data": [np.random.rand() for _ in range(4)],
    }
)

# The dataframe looks like this. 
   bird_type    avg_speed   no_of_birds_observed    reliability_of_data
0   falcon        66            3                       0.553841
1   crane         159           8                       0.472359
2   crane         158           7                       0.493193
3   falcon        161           7                       0.585865

Now, I would like to have the weighted average (according to the number_of_birds_surveyed) for the average_speed and reliability variables. For that I have a simple function as follows, which calculates the weighted average.

def func(data, numbers):
    ans = 0
    for a, b in zip(data, numbers):
        ans = ans   a*b
    ans = ans / sum(numbers)
    return ans

How can I apply the function of func to both average speed and reliability variables?

I expect the answer to be a dataframe like follows

    bird_type   avg_speed        no_of_birds_observed  reliability_of_data
0   falcon      132.5                 10                   0.5762578   
# how       (66*3   161*7)/(3 7)    (3 10)     (0.553841×3 0.585865×7)/(3 7)
1   crane       158.53                15                   0.4820815
# how      (159*8   158*7)/(8 7)    (8 7)     (0.472359×8 0.493193×7)/(8 7)

I saw this question, but could not generalize the solution / understand it completely. I thought of not asking the question, but according to this blog post by SO and this meta question, with a different example, I think this question can be considered a "borderline duplicate". An answer will benefit me and probably some others will also find this useful. So finally decided to ask.

CodePudding user response：

If want aggregate by GroupBy.agg for weights parameter is used no_of_birds_observed by DataFrame.loc:

#for correct ouput need default (or unique values) index
df = df.reset_index(drop=True)


f = lambda x: np.average(x,  weights= df.loc[x.index, 'no_of_birds_observed'])
df1 = (df.groupby('bird_type', sort=False, as_index=False)
          .agg(avg=('avg_speed',f),
               no_of_birds=('no_of_birds_observed','sum'),
               reliability_of_data=('reliability_of_data', f)))
print (df1)
  bird_type         avg  no_of_birds  reliability_of_data
0    falcon  132.500000           10             0.576258
1     crane  158.533333           15             0.482082

CodePudding user response：

Don't use a function with apply, rather perform a classical aggregation:

cols = ['avg_speed', 'reliability_of_data']

# multiply relevant columns by no_of_birds_observed
# aggregate everything as sum
out = (df[cols].mul(df['no_of_birds_observed'], axis=0)
       .combine_first(df)
       .groupby('bird_type').sum()
      )

# divide the relevant columns by the sum of no_of_birds_observed
out[cols] = out[cols].div(out['no_of_birds_observed'], axis=0)

Output:

            avg_speed  no_of_birds_observed  reliability_of_data
bird_type                                                       
crane      158.533333                    15             0.482082
falcon     132.500000                    10             0.576258