How to do a custom Group By?-CodePudding

My goal is to group a data frame DF by values of column Name and aggregate specific column as sum.

Current data frame

	Name	Val1	val2	val3
0	Test	NaN	5	NaN
1	Test	30	NaN	3
2	Test	30	NaN	3

Output excepted

	Name	Val1	val2	val3
0	Test	60	5	3

What I tried

DF.groupby(['Name'], as_index=False)[["Val1"]].sum()

returns

	Name	Val1
0	Test	60

Issue

I want to take val2 and val3 as unique values and then group them but I don't know how to do so.

Maybe introducing an intermediary DF

	Name	Val1	val2	val3
0	Test	NaN	5	3
1	Test	30	5	3
2	Test	30	5	3

so that following code can work:

DF.groupby(['Name','val2','val3'], as_index=False)[["Val1"]].sum()

Keep in mind that my data frame has several values for Name in it.

What is the best way to do ?

CodePudding user response：

If I understand correctly, there is only one unique non-missing value in each of the val2 and val3 columns per group. Otherwise your question does not make much sense, because you did not specify how to decide which value to take from these columns.

Given these constraints, you can use:

result = df.groupby('Name', as_index=False).agg({'Val1': 'sum', 'val2': 'first', 'val3': 'first'})

CodePudding user response：

To group by one or multiple columns while aggregating others you can use groupBy followed by aggregate (or its alias agg).

Example

Given input:

  Waiter  Revenue  Hours   Tables Gender
0  Alice     3000    3.0  Outside      f
1    Bob     2000    4.0   Inside      m
2   Alex     1000    2.0   Inside      d
3   Alex      500    0.5  Outside   None

Expected output:

        Revenue  Hours             Tables Gender
Waiter                                          
Alex       1500   1.25  {Inside, Outside}      d
Alice      3000   3.00          {Outside}      f
Bob        2000   4.00           {Inside}      m

Grouped by Waiter shows:

the sum of Revenue
the mean of worked Hours
the set of unique Tables served
the first (non undefined) value for Gender

How to group-by with specific aggregation

Code:

import pandas as pd

df = pd.DataFrame({'Waiter': ['Alice','Bob','Alex', 'Alex'], 'Revenue': [3000, 2000, 1000, 500], 'Hours': [3, 4, 2, 0.5], 'Tables': ['Outside', 'Inside', 'Inside', 'Outside'], 'Gender': ['f', 'm', 'd', None]})

df.groupby(['Waiter']).agg({'Revenue': 'sum', 'Hours': 'mean', 'Tables': lambda x: set(x), 'Gender': 'first'})

Explained:

groupby can be done be multiple of a single column, here only Waiter
agg (or aggregate) by a dict defining the aggregation for each column. The column is specified as key like Revenue with the aggregate function specified either as function name 'sum' (in quotes or as reference like list) or as lambda like lambda x: set(x)

Note: to get the list of Tables we could also define following value as aggregation function:

'unique' produces a list with unique values (see Series.unique)
set for a set as we did with the lambda
list for a list (which may contain duplicates)