Home > Enterprise >  Pandas groupby and apply aggregate function across rows
Pandas groupby and apply aggregate function across rows

Time:02-22

I'm having difficulties applying customs functions to a groupby operation in pandas. Let's suppose that I have the following DataFrame to work with:

import pandas as pd

df = pd.DataFrame(
    {
        "id": [1, 1, 2, 2],
        "flag": ["A", "A", "B", "B"],
        "value1": [520, 250, 180, 360],
        "value2": [11, 5, 7, 2],
    }
)

print(df)

   id flag  value1  value2
0   1    A     520      11
1   1    A     250       5
2   2    B     180       7
3   2    B     360       2

I need to apply 4 aggregate functions to the above DataFrame grouped by id and flag. Specifically, for each id and flag:

  1. Calculate the mean of value1;
  2. Calculate the sum of value2;
  3. Calculate the mean of (value1 * value2) / 12;
  4. Calculate the sum of (value1 / value2).

I don't have any issues with the first two. This is what I did to calculate them:

df.groupby(["id", "flag"]).agg({"value1": ["mean"], "value2": ["sum"]})

        value1 value2
          mean    sum
id flag              
1  A     385.0     16
2  B     270.0      9

My problems are related to the last two aggregates. I search here for similar problems and I think I need to create two custom functions and apply them to the groupby object. Unfortunately, all my attempts failed and I wasn't able to work this out. Also, if possible, I want all results to be in a single DataFrame like below (hopefully, I've calculated the numbers correctly):

        value1 value2
          mean    sum     func1     func2
id flag              
1  A     385.0     16    290.42     97.27
2  B     270.0      9      82.5    205.71

CodePudding user response:

groupby().agg. only takes in values of one columns.

With custom functions involving several columns, I would do something like this:

groupby = df.groupby(['id','flag'])

out = pd.DataFrame({
    'value1': groupby['value1'].mean(),
    'value2': groupby['value2'].sum(),
    'value3': groupby.apply(lambda x: (x['value1'] * x['value2']).mean()/12),
    'value4': groupby.apply(lambda x: (x['value1']/x['value2']).sum())
})

Output:

         value1  value2      value3      value4
id flag                                        
1  A        385      16  290.416667   97.272727
2  B        270       9   82.500000  205.714286

CodePudding user response:

What I'd do in this case is store (value1 * value2) / 12 and (value1 / value2) in temporary columns, and then aggregate:

agg_funcs = {
    'value1': 'mean',
    'value2': 'sum',
    'func1': 'mean',
    'func2': 'sum',
}

agg = (
    df.assign(
        func1=lambda x: (x['value1'] * x['value2']) / 12,
        func2=lambda x: (x['value1'] / x['value2'])
    )
    .groupby(["id", "flag"])
    .agg(agg_funcs)
)

Output:

>>> agg
         value1  value2       func1       func2
id flag                                        
1  A      385.0      16  290.416667   97.272727
2  B      270.0       9   82.500000  205.714286

Obligatory one-liner:

df.assign(func1=lambda x: (x.value1 * x.value2) / 12, func2=lambda x: (x.value1 / x.value2)).groupby(["id","flag"]).agg({'value1': 'mean', 'value2': 'sum', 'func1': 'mean', 'func2': 'sum'})
  • Related