Home > Mobile >  Year over Year difference and selecting maximum row in pandas
Year over Year difference and selecting maximum row in pandas

Time:08-09

I have a dataframe given as below:

ID YEAR     NPS
500 2020     0
500 2021     0
500 2022     0
501 2020    32      
501 2021    52      
501 2022    99      
503 2021    1       
503 2022    4       
504 2020    45      
504 2021    55      
504 2022    50      

I have to calculate year over year difference as given below:

ID      YEAR    NPS  nps_gain_yoy
500     2020     0      0
500     2021     0      0
500     2022     0      0
501     2020    32      0
501     2021    52      20
501     2022    99      47
503     2021    1       0
503     2022    4       3
504     2020    45      0
504     2021    55      10
504     2022    50      -5

In above output for starting year 2020 or first occurance of Id nps_gain_yoy needs to be zero then for 2021 nps_gain_yoy is difference between nps of 2021 and 2020 i.e 52-32 = 20 as shown in output for ID 501 for year 2021 and so on. After this I need to pick the maximum difference or maximum nps_gain_yoy for each ID as given in below output:

ID  YEAR    NPS NPS_gain_yoy
501 2022     0    0
501 2022    99   47
503 2022    4    3
504 2021    55   10

Here 47 is the maximum nps gain for ID 501 in year 2022 similarly 3 for ID 503 and 4 for Id 504.

Please help I need it urgently.

CodePudding user response:

If years are consecutive per ID first use DataFrameGroupBy.diff:

df = df.sort_values(['ID','YEAR'])

df['nps_gain_yoy'] = df.groupby('ID')['NPS'].diff().fillna(0)
print (df)
     ID  YEAR  NPS  nps_gain_yoy
0   500  2020    0           0.0
1   500  2021    0           0.0
2   500  2022    0           0.0
3   501  2020   32           0.0
4   501  2021   52          20.0
5   501  2022   99          47.0
6   503  2021    1           0.0
7   503  2022    4           3.0
8   504  2020   45           0.0
9   504  2021   55          10.0
10  504  2022   50          -5.0

And then DataFrameGroupBy.idxmax with DataFrame.loc:

df1 = df.loc[df.iloc[::-1].groupby('ID')['nps_gain_yoy'].idxmax()]
#alternative solution
#df1 = df.sort_values(['ID','nps_gain_yoy']).drop_duplicates('ID', keep='last')
print (df1)
    ID  YEAR  NPS  nps_gain_yoy
2  500  2022    0           0.0
5  501  2022   99          47.0
7  503  2022    4           3.0
9  504  2021   55          10.0
  • Related