Assuming I'm dealing with this dataframe:
ID | Qualified | Year | Amount A | Amount B |
---|---|---|---|---|
1 | No | 2020 | 0 | 150 |
1 | No | 2019 | 0 | 100 |
1 | Yes | 2019 | 10 | 15 |
1 | No | 2018 | 0 | 100 |
1 | Yes | 2018 | 10 | 150 |
2 | Yes | 2020 | 0 | 200 |
2 | No | 2017 | 0 | 100 |
... | ... | ... | ... |
My desired output should be like this:
ID | Qualified | Year | Amount A | Amount B |
---|---|---|---|---|
1 | No | 2020 | 0 | 150 |
1 | Partial | 2019 | 10 | 115 |
1 | Partial | 2018 | 10 | 250 |
2 | Yes | 2020 | 0 | 200 |
2 | No | 2017 | 0 | 100 |
... | ... | ... | ... |
As you can see, Qualified column creates new merged values (Yes & No -> Partial, amount A B ) from a condition: a year in an ID includes both Yes and No in Qualified column.
Don't know how to approach it. Anyone could provide any methodology?
CodePudding user response:
You can use the function agg()
and groupby()
to perform this operation.
agg()
allows you to use not only common aggregation functions (such as sum
, mean
, etc.) but also custom defined functions.
I would do as follows:
def agg_qualify(x):
values = x.unique()
if len(x)>1:
return 'Partial'
return values[0]
df.groupby(['ID', 'Year']).agg({
'Qualified': lambda x: agg_qualify(x),
'Amount A': 'sum',
'Amount B': 'sum',
}).reset_index()
Output:
ID Year Qualified Amount A Amount B
0 1 2018 Partial 10 250.0
1 1 2019 Partial 10 115.0
2 1 2020 No 0 150.0
3 2 2020 Yes 0 200.0