I have a dataframe sorted by ID
and in descending order of Date in Pandas that looks like
ID Date A Salary
1 2022-12-01 2 100
1 2022-11-11 3 200
1 2022-10-25 1 150
1 2022-05-17 4 160
2 2022-12-01 2 170
2 2022-11-19 1 220
2 2022-10-10 1 160
3 2022-11-11 3 350
3 2022-09-11 1 200
3 2022-08-19 3 160
3 2022-07-20 3 190
3 2022-05-11 3 200
I would like to add a new column Salary_argmin_recent_A
that outputs the most recent Salary row of a specific ID such that A=1, so the desired output looks like
ID Date A Salary Salary_argmin_recent_A
1 2022-12-01 2 100 150 (most recent salary such that A=1 is 2022-10-25)
1 2022-11-11 3 200 150 (most recent salary such that A=1 is 2022-10-25)
1 2022-10-25 1 150 NaN (no rows before with A=1 for ID 1)
1 2022-05-17 4 160 NaN (no rows before with A=1 for ID 1)
2 2022-12-01 2 170 220
2 2022-11-19 1 220 160
2 2022-10-10 1 160 NaN
3 2022-11-11 3 350 200
3 2022-09-11 1 200 NaN
3 2022-08-19 3 160 NaN
3 2022-07-20 3 190 NaN
3 2022-05-11 3 200 NaN
Thanks in advance.
CodePudding user response:
s1 = df['Salary'].where(df['A'].eq(1)).groupby(df['ID']).bfill()
s2 = df.groupby(['ID', 'A'])['Salary'].shift(-1)
out = df.assign(Salary_argmin_recent_A=s1.mask(df['A'].eq(1), s2))
out
ID Date A Salary Salary_argmin_recent_A
0 1 2022-12-01 2 100 150.0
1 1 2022-11-11 3 200 150.0
2 1 2022-10-25 1 150 NaN
3 1 2022-05-17 4 160 NaN
4 2 2022-12-01 2 170 220.0
5 2 2022-11-19 1 220 160.0
6 2 2022-10-10 1 160 NaN
7 3 2022-11-11 3 350 200.0
8 3 2022-09-11 1 200 NaN
9 3 2022-08-19 3 160 NaN
10 3 2022-07-20 3 190 NaN
11 3 2022-05-11 3 200 NaN
CodePudding user response:
Given you're looking for the values below, the first thing that comes to mind is to iterate through rows. Usually iteration should be escaped and there might be a more elegant solution to this, but at least it works.
import pandas as pd
df = pd.read_clipboard()
new_col = []
for index, row in df.iterrows():
df_below = df.iloc[index 1:]
match = df_below[(df_below.A == 1) & (df_below.ID == row.ID)].Salary
if match.any():
new_col.append(match.iloc[0])
else:
new_col.append(None)
df['Salary_argmin_recent_A'] = new_col
print(df)
Output:
ID Date A Salary Salary_argmin_recent_A
0 1 2022-12-01 2 100 150.0
1 1 2022-11-11 3 200 150.0
2 1 2022-10-25 1 150 NaN
3 1 2022-05-17 4 160 NaN
4 2 2022-12-01 2 170 220.0
5 2 2022-11-19 1 220 160.0
6 2 2022-10-10 1 160 NaN
7 3 2022-11-11 3 350 200.0
8 3 2022-09-11 1 200 NaN
9 3 2022-08-19 3 160 NaN
10 3 2022-07-20 3 190 NaN
11 3 2022-05-11 3 200 NaN
Here I iterate through all the rows, on each iteration I take the index value and store a cut of DataFrame in df_below
which represents all the rows below current row.
I than filter values by A
and ID
, results may contain nothing, one or more values. So I check if there are some results and if true, I take the first value, the most recent one.