Home > other >  Pandas groupby.agg with condition
Pandas groupby.agg with condition

Time:01-22

I have a pandas data frame similar to this:

name sales profit profit_flag
Joe 200 100 True
Joe 300 150 False
Mark 200 100 True
Mark 300 150 True
Judy 300 150 False

The actual data frame has 100 columns.

The idea is: I want to group by name, and aggregate all the columns. However, certain columns depend on a flag. In this case, sales will be aggregated no matter what, but profit should be included in the aggregation only if profit_flag is True.

It should look like this if we use sum:

name sales profit
Joe 500 100
Judy 300 Nan
Mark 500 250

Is there anyway I can do this from one line using df.groupby('name').agg()?

Right now I'm using:

grouped = pd.DataFrame()
grouped['sales'] = df.groupby('name').sales.sum()
grouped['profit'] = df[df.profit_flag].groupby('name').profit.sum()

I'm getting the correct result, but since the actual data frame has many more columns, I wanted to know if I could somehow write something like this to avoid the clutter:

grouped = df.groupby('name').agg({
          'sales': 'sum',
          'profit' 'sum' #if profit_flag })

Is this even possible or should I just group 'flag dependent columns' in separate statemetns?

CodePudding user response:

You can mask the values prior to aggregation:

(df.assign(profit=lambda d: d['profit'].where(d['profit_flag']))
   .groupby('name', as_index=False)[['sales', 'profit']].sum(min_count=1)
)

Output:

   name  sales  profit
0   Joe    500   100.0
1  Judy    300     NaN
2  Mark    500   250.0
  • Related