My goal is to group a data frame DF
by values of column Name
and aggregate specific column as sum.
Current data frame
Name | Val1 | val2 | val3 | |
---|---|---|---|---|
0 | Test | NaN | 5 | NaN |
1 | Test | 30 | NaN | 3 |
2 | Test | 30 | NaN | 3 |
Output excepted
Name | Val1 | val2 | val3 | |
---|---|---|---|---|
0 | Test | 60 | 5 | 3 |
What I tried
DF.groupby(['Name'], as_index=False)[["Val1"]].sum()
returns
Name | Val1 | |
---|---|---|
0 | Test | 60 |
Issue
I want to take val2
and val3
as unique values and then group them but I don't know how to do so.
Maybe introducing an intermediary DF
Name | Val1 | val2 | val3 | |
---|---|---|---|---|
0 | Test | NaN | 5 | 3 |
1 | Test | 30 | 5 | 3 |
2 | Test | 30 | 5 | 3 |
so that following code can work:
DF.groupby(['Name','val2','val3'], as_index=False)[["Val1"]].sum()
Keep in mind that my data frame has several values for Name
in it.
What is the best way to do ?
CodePudding user response:
If I understand correctly, there is only one unique non-missing value in each of the val2 and val3 columns per group. Otherwise your question does not make much sense, because you did not specify how to decide which value to take from these columns.
Given these constraints, you can use:
result = df.groupby('Name', as_index=False).agg({'Val1': 'sum', 'val2': 'first', 'val3': 'first'})
CodePudding user response:
To group by one or multiple columns while aggregating others you can use groupBy
followed by aggregate
(or its alias agg
).
Example
Given input:
Waiter Revenue Hours Tables Gender
0 Alice 3000 3.0 Outside f
1 Bob 2000 4.0 Inside m
2 Alex 1000 2.0 Inside d
3 Alex 500 0.5 Outside None
Expected output:
Revenue Hours Tables Gender
Waiter
Alex 1500 1.25 {Inside, Outside} d
Alice 3000 3.00 {Outside} f
Bob 2000 4.00 {Inside} m
Grouped by Waiter
shows:
- the sum of
Revenue
- the mean of worked
Hours
- the set of unique
Tables
served - the first (non undefined) value for
Gender
How to group-by with specific aggregation
Code:
import pandas as pd
df = pd.DataFrame({'Waiter': ['Alice','Bob','Alex', 'Alex'], 'Revenue': [3000, 2000, 1000, 500], 'Hours': [3, 4, 2, 0.5], 'Tables': ['Outside', 'Inside', 'Inside', 'Outside'], 'Gender': ['f', 'm', 'd', None]})
df.groupby(['Waiter']).agg({'Revenue': 'sum', 'Hours': 'mean', 'Tables': lambda x: set(x), 'Gender': 'first'})
Explained:
groupby
can be done be multiple of a single column, here onlyWaiter
agg
(oraggregate
) by a dict defining the aggregation for each column. The column is specified as key likeRevenue
with the aggregate function specified either as function name'sum'
(in quotes or as reference likelist
) or as lambda likelambda x: set(x)
Note: to get the list of Tables
we could also define following value as aggregation function:
'unique'
produces a list with unique values (seeSeries.unique
)set
for a set as we did with the lambdalist
for a list (which may contain duplicates)