I have a pandas data frame similar to this:
name | sales | profit | profit_flag |
---|---|---|---|
Joe | 200 | 100 | True |
Joe | 300 | 150 | False |
Mark | 200 | 100 | True |
Mark | 300 | 150 | True |
Judy | 300 | 150 | False |
The actual data frame has 100 columns.
The idea is: I want to group by name, and aggregate all the columns. However, certain columns depend on a flag. In this case, sales
will be aggregated no matter what, but profit
should be included in the aggregation only if profit_flag
is True.
It should look like this if we use sum
:
name | sales | profit |
---|---|---|
Joe | 500 | 100 |
Judy | 300 | Nan |
Mark | 500 | 250 |
Is there anyway I can do this from one line using df.groupby('name').agg()
?
Right now I'm using:
grouped = pd.DataFrame()
grouped['sales'] = df.groupby('name').sales.sum()
grouped['profit'] = df[df.profit_flag].groupby('name').profit.sum()
I'm getting the correct result, but since the actual data frame has many more columns, I wanted to know if I could somehow write something like this to avoid the clutter:
grouped = df.groupby('name').agg({
'sales': 'sum',
'profit' 'sum' #if profit_flag })
Is this even possible or should I just group 'flag dependent columns' in separate statemetns?
CodePudding user response:
You can mask the values prior to aggregation:
(df.assign(profit=lambda d: d['profit'].where(d['profit_flag']))
.groupby('name', as_index=False)[['sales', 'profit']].sum(min_count=1)
)
Output:
name sales profit
0 Joe 500 100.0
1 Judy 300 NaN
2 Mark 500 250.0