Home > Back-end >  How to do a custom Group By?
How to do a custom Group By?

Time:04-30

My goal is to group a data frame DF by values of column Name and aggregate specific column as sum.

Current data frame

Name Val1 val2 val3
0 Test NaN 5 NaN
1 Test 30 NaN 3
2 Test 30 NaN 3

Output excepted

Name Val1 val2 val3
0 Test 60 5 3

What I tried

DF.groupby(['Name'], as_index=False)[["Val1"]].sum()

returns

Name Val1
0 Test 60

Issue

I want to take val2 and val3 as unique values and then group them but I don't know how to do so.

Maybe introducing an intermediary DF

Name Val1 val2 val3
0 Test NaN 5 3
1 Test 30 5 3
2 Test 30 5 3

so that following code can work:

DF.groupby(['Name','val2','val3'], as_index=False)[["Val1"]].sum()

Keep in mind that my data frame has several values for Name in it.

What is the best way to do ?

CodePudding user response:

If I understand correctly, there is only one unique non-missing value in each of the val2 and val3 columns per group. Otherwise your question does not make much sense, because you did not specify how to decide which value to take from these columns.

Given these constraints, you can use:

result = df.groupby('Name', as_index=False).agg({'Val1': 'sum', 'val2': 'first', 'val3': 'first'})

CodePudding user response:

To group by one or multiple columns while aggregating others you can use groupBy followed by aggregate (or its alias agg).

Example

Given input:

  Waiter  Revenue  Hours   Tables Gender
0  Alice     3000    3.0  Outside      f
1    Bob     2000    4.0   Inside      m
2   Alex     1000    2.0   Inside      d
3   Alex      500    0.5  Outside   None

Expected output:

        Revenue  Hours             Tables Gender
Waiter                                          
Alex       1500   1.25  {Inside, Outside}      d
Alice      3000   3.00          {Outside}      f
Bob        2000   4.00           {Inside}      m

Grouped by Waiter shows:

  • the sum of Revenue
  • the mean of worked Hours
  • the set of unique Tables served
  • the first (non undefined) value for Gender

How to group-by with specific aggregation

Code:

import pandas as pd

df = pd.DataFrame({'Waiter': ['Alice','Bob','Alex', 'Alex'], 'Revenue': [3000, 2000, 1000, 500], 'Hours': [3, 4, 2, 0.5], 'Tables': ['Outside', 'Inside', 'Inside', 'Outside'], 'Gender': ['f', 'm', 'd', None]})

df.groupby(['Waiter']).agg({'Revenue': 'sum', 'Hours': 'mean', 'Tables': lambda x: set(x), 'Gender': 'first'})

Explained:

  1. groupby can be done be multiple of a single column, here only Waiter
  2. agg (or aggregate) by a dict defining the aggregation for each column. The column is specified as key like Revenue with the aggregate function specified either as function name 'sum' (in quotes or as reference like list) or as lambda like lambda x: set(x)

Note: to get the list of Tables we could also define following value as aggregation function:

  • 'unique' produces a list with unique values (see Series.unique)
  • set for a set as we did with the lambda
  • list for a list (which may contain duplicates)

See also

  • Related