Updating a column in a dataframe with latest value from the latest year-CodePudding

Lets say I have a dataframe:

df = 
|ID  | year | value    |
|----|------|----------|
|123 | 2011 | Mango    |
|232 | 2010 | Pineapple|
|123 | 2022 | Orange   |
|232 | 2021 | Apple    |
|221 | 2021 | Banana   |

I want to update the dataframe value with the latest years value. I am expecting a final df as:

|ID  | year | value    |
|----|------|----------|
|123 | 2011 | Orange   |
|232 | 2010 | Apple    |
|123 | 2022 | Orange   |
|232 | 2021 | Apple    |
|221 | 2021 | Banana   |

Basically we want to update the values with the latest year's values. So in this case, id - 123 is appearing twice in the same df. They both have different values "Mango" in 2011 and "Orange" in 2022. We wish to have a new df created with same columns and same repetitions but with latest year's values.

I need this to be done without using any loops as the originial df is extremely huge and using any loop is taking huge time to run

CodePudding user response：

You need to use 'Rank' & 'Merge' as below, gives required output

df = pd.DataFrame({'ID':[123,232,123,232,221],'Year':[2011,2010,2022,2021,2021],'Value':['Mango','Pineapple','Orange','Apple','Banana']})
df['ID_Year_Rank'] = df.groupby(['ID'])['Year'].rank(method='first', ascending=False)
df

This will add a rank == 1 to each row where year is latest in every ID

After this simple merge with itself based on filtered values give required result

pd.merge(df[['ID','Year']], df[df['ID_Year_Rank']==1][['ID','Value']], left_on='ID', right_on = 'ID')

CodePudding user response：

Try this. Use the indices of each ID's most recent year to index value column with it using loc[] accessor.

# indices of last years of each ID
indx = df.groupby('ID')['year'].transform('idxmax')
# assign values corresponding to the last years back to value
df['value'] = df.loc[indx, 'value'].tolist()
df