I have a dataframe given as below:
ID YEAR NPS
500 2020 0
500 2021 0
500 2022 0
501 2020 32
501 2021 52
501 2022 99
503 2021 1
503 2022 4
504 2020 45
504 2021 55
504 2022 50
I have to calculate year over year difference as given below:
ID YEAR NPS nps_gain_yoy
500 2020 0 0
500 2021 0 0
500 2022 0 0
501 2020 32 0
501 2021 52 20
501 2022 99 47
503 2021 1 0
503 2022 4 3
504 2020 45 0
504 2021 55 10
504 2022 50 -5
In above output for starting year 2020 or first occurance of Id nps_gain_yoy needs to be zero then for 2021 nps_gain_yoy is difference between nps of 2021 and 2020 i.e 52-32 = 20 as shown in output for ID 501 for year 2021 and so on. After this I need to pick the maximum difference or maximum nps_gain_yoy for each ID as given in below output:
ID YEAR NPS NPS_gain_yoy
501 2022 0 0
501 2022 99 47
503 2022 4 3
504 2021 55 10
Here 47 is the maximum nps gain for ID 501 in year 2022 similarly 3 for ID 503 and 4 for Id 504.
Please help I need it urgently.
CodePudding user response:
If years are consecutive per ID
first use DataFrameGroupBy.diff
:
df = df.sort_values(['ID','YEAR'])
df['nps_gain_yoy'] = df.groupby('ID')['NPS'].diff().fillna(0)
print (df)
ID YEAR NPS nps_gain_yoy
0 500 2020 0 0.0
1 500 2021 0 0.0
2 500 2022 0 0.0
3 501 2020 32 0.0
4 501 2021 52 20.0
5 501 2022 99 47.0
6 503 2021 1 0.0
7 503 2022 4 3.0
8 504 2020 45 0.0
9 504 2021 55 10.0
10 504 2022 50 -5.0
And then DataFrameGroupBy.idxmax
with DataFrame.loc
:
df1 = df.loc[df.iloc[::-1].groupby('ID')['nps_gain_yoy'].idxmax()]
#alternative solution
#df1 = df.sort_values(['ID','nps_gain_yoy']).drop_duplicates('ID', keep='last')
print (df1)
ID YEAR NPS nps_gain_yoy
2 500 2022 0 0.0
5 501 2022 99 47.0
7 503 2022 4 3.0
9 504 2021 55 10.0