Returning most recent row with certain values in Pandas-CodePudding

I have a dataframe sorted by ID and in descending order of Date in Pandas that looks like

ID  Date       A  Salary
1   2022-12-01 2  100
1   2022-11-11 3  200
1   2022-10-25 1  150
1   2022-05-17 4  160
2   2022-12-01 2  170
2   2022-11-19 1  220
2   2022-10-10 1  160
3   2022-11-11 3  350
3   2022-09-11 1  200
3   2022-08-19 3  160
3   2022-07-20 3  190
3   2022-05-11 3  200

I would like to add a new column Salary_argmin_recent_A that outputs the most recent Salary row of a specific ID such that A=1, so the desired output looks like

ID  Date       A  Salary Salary_argmin_recent_A
1   2022-12-01 2  100    150 (most recent salary such that A=1 is 2022-10-25)
1   2022-11-11 3  200    150 (most recent salary such that A=1 is 2022-10-25)
1   2022-10-25 1  150    NaN (no rows before with A=1 for ID 1)
1   2022-05-17 4  160    NaN (no rows before with A=1 for ID 1)
2   2022-12-01 2  170    220
2   2022-11-19 1  220    160
2   2022-10-10 1  160    NaN
3   2022-11-11 3  350    200
3   2022-09-11 1  200    NaN
3   2022-08-19 3  160    NaN
3   2022-07-20 3  190    NaN
3   2022-05-11 3  200    NaN

Thanks in advance.

CodePudding user response：

s1 = df['Salary'].where(df['A'].eq(1)).groupby(df['ID']).bfill()
s2 = df.groupby(['ID', 'A'])['Salary'].shift(-1)
out = df.assign(Salary_argmin_recent_A=s1.mask(df['A'].eq(1), s2))

out

    ID  Date        A   Salary  Salary_argmin_recent_A
0   1   2022-12-01  2   100     150.0
1   1   2022-11-11  3   200     150.0
2   1   2022-10-25  1   150     NaN
3   1   2022-05-17  4   160     NaN
4   2   2022-12-01  2   170     220.0
5   2   2022-11-19  1   220     160.0
6   2   2022-10-10  1   160     NaN
7   3   2022-11-11  3   350     200.0
8   3   2022-09-11  1   200     NaN
9   3   2022-08-19  3   160     NaN
10  3   2022-07-20  3   190     NaN
11  3   2022-05-11  3   200     NaN

CodePudding user response：

Given you're looking for the values below, the first thing that comes to mind is to iterate through rows. Usually iteration should be escaped and there might be a more elegant solution to this, but at least it works.

import pandas as pd
df = pd.read_clipboard()

new_col = []
for index, row in df.iterrows():
    df_below = df.iloc[index 1:]
    match = df_below[(df_below.A == 1) & (df_below.ID == row.ID)].Salary
    if match.any():
        new_col.append(match.iloc[0])
    else:
        new_col.append(None)
df['Salary_argmin_recent_A'] = new_col
print(df)

Output:

    ID        Date  A  Salary  Salary_argmin_recent_A
0    1  2022-12-01  2     100                   150.0
1    1  2022-11-11  3     200                   150.0
2    1  2022-10-25  1     150                     NaN
3    1  2022-05-17  4     160                     NaN
4    2  2022-12-01  2     170                   220.0
5    2  2022-11-19  1     220                   160.0
6    2  2022-10-10  1     160                     NaN
7    3  2022-11-11  3     350                   200.0
8    3  2022-09-11  1     200                     NaN
9    3  2022-08-19  3     160                     NaN
10   3  2022-07-20  3     190                     NaN
11   3  2022-05-11  3     200                     NaN

Here I iterate through all the rows, on each iteration I take the index value and store a cut of DataFrame in df_below which represents all the rows below current row.
I than filter values by A and ID, results may contain nothing, one or more values. So I check if there are some results and if true, I take the first value, the most recent one.