Home > Net >  Returning most recent row with certain values in Pandas
Returning most recent row with certain values in Pandas

Time:12-29

I have a dataframe sorted by ID and in descending order of Date in Pandas that looks like

ID  Date       A  Salary
1   2022-12-01 2  100
1   2022-11-11 3  200
1   2022-10-25 1  150
1   2022-05-17 4  160
2   2022-12-01 2  170
2   2022-11-19 1  220
2   2022-10-10 1  160
3   2022-11-11 3  350
3   2022-09-11 1  200
3   2022-08-19 3  160
3   2022-07-20 3  190
3   2022-05-11 3  200

I would like to add a new column Salary_argmin_recent_A that outputs the most recent Salary row of a specific ID such that A=1, so the desired output looks like

ID  Date       A  Salary Salary_argmin_recent_A
1   2022-12-01 2  100    150 (most recent salary such that A=1 is 2022-10-25)
1   2022-11-11 3  200    150 (most recent salary such that A=1 is 2022-10-25)
1   2022-10-25 1  150    NaN (no rows before with A=1 for ID 1)
1   2022-05-17 4  160    NaN (no rows before with A=1 for ID 1)
2   2022-12-01 2  170    220
2   2022-11-19 1  220    160
2   2022-10-10 1  160    NaN
3   2022-11-11 3  350    200
3   2022-09-11 1  200    NaN
3   2022-08-19 3  160    NaN
3   2022-07-20 3  190    NaN
3   2022-05-11 3  200    NaN

Thanks in advance.

CodePudding user response:

s1 = df['Salary'].where(df['A'].eq(1)).groupby(df['ID']).bfill()
s2 = df.groupby(['ID', 'A'])['Salary'].shift(-1)
out = df.assign(Salary_argmin_recent_A=s1.mask(df['A'].eq(1), s2))

out

    ID  Date        A   Salary  Salary_argmin_recent_A
0   1   2022-12-01  2   100     150.0
1   1   2022-11-11  3   200     150.0
2   1   2022-10-25  1   150     NaN
3   1   2022-05-17  4   160     NaN
4   2   2022-12-01  2   170     220.0
5   2   2022-11-19  1   220     160.0
6   2   2022-10-10  1   160     NaN
7   3   2022-11-11  3   350     200.0
8   3   2022-09-11  1   200     NaN
9   3   2022-08-19  3   160     NaN
10  3   2022-07-20  3   190     NaN
11  3   2022-05-11  3   200     NaN

CodePudding user response:

Given you're looking for the values below, the first thing that comes to mind is to iterate through rows. Usually iteration should be escaped and there might be a more elegant solution to this, but at least it works.

import pandas as pd
df = pd.read_clipboard()

new_col = []
for index, row in df.iterrows():
    df_below = df.iloc[index 1:]
    match = df_below[(df_below.A == 1) & (df_below.ID == row.ID)].Salary
    if match.any():
        new_col.append(match.iloc[0])
    else:
        new_col.append(None)
df['Salary_argmin_recent_A'] = new_col
print(df)
Output:
    ID        Date  A  Salary  Salary_argmin_recent_A
0    1  2022-12-01  2     100                   150.0
1    1  2022-11-11  3     200                   150.0
2    1  2022-10-25  1     150                     NaN
3    1  2022-05-17  4     160                     NaN
4    2  2022-12-01  2     170                   220.0
5    2  2022-11-19  1     220                   160.0
6    2  2022-10-10  1     160                     NaN
7    3  2022-11-11  3     350                   200.0
8    3  2022-09-11  1     200                     NaN
9    3  2022-08-19  3     160                     NaN
10   3  2022-07-20  3     190                     NaN
11   3  2022-05-11  3     200                     NaN

Here I iterate through all the rows, on each iteration I take the index value and store a cut of DataFrame in df_below which represents all the rows below current row.
I than filter values by A and ID, results may contain nothing, one or more values. So I check if there are some results and if true, I take the first value, the most recent one.

  • Related