Home > Software design >  Keep the row immediately after a date (Pandas)
Keep the row immediately after a date (Pandas)

Time:06-02

I have a DataFrame which looks like this:

ID Record Date Cutoff Date Value
A 2022-01-01 2022-01-31 1
A 2022-01-15 2022-01-31 2
A 2022-01-28 2022-01-31 3
B 2022-01-15 2022-01-31 1
B 2022-01-28 2022-01-31 2
B 2022-01-31 2022-01-31 3
C 2022-01-01 2022-01-31 1
C 2022-01-27 2022-01-31 2
C 2022-02-01 2022-01-31 3
C 2022-02-03 2022-01-31 4
D 2022-06-03 2022-01-31 1

For each ID, I want to keep only 1 row, based on the following criteria:

  1. If there is a row with Record Date == Cutoff Date, keep this row.
  2. If all Record Date is before Cutoff Date, keep the row which is closet to Cutoff Date.
  3. If there exist any Record Date after Cutoff Date, keep the earliest row after Cutoff Date.
  4. If there is only 1 row for an ID, keep this row.

Note that Cutoff Date is constant for each ID. (I set all Cutoff Date identical to make it simplier).

The desired output:

ID Record Date Cutoff Date Value
A 2022-01-28 2022-01-31 3
B 2022-01-31 2022-01-31 3
C 2022-02-01 2022-01-31 3
D 2022-06-03 2022-01-31 1

CodePudding user response:

IIUC, you can craft a sorter DataFrame based on the timedeltas and use it to select the top index:

s = df['Record Date'].sub(df['Cutoff Date'])

sorter = pd.concat([-s.clip(upper='0'), s.abs()], axis=1).sort_values(by=[0,1])

out = df.loc[sorter.groupby(df['ID']).head(1).sort_index().index]

output:

   ID Record Date Cutoff Date  Value
2   A  2022-01-28  2022-01-31      3
5   B  2022-01-31  2022-01-31      3
8   C  2022-02-01  2022-01-31      3
10  D  2022-06-03  2022-01-31      1

intermediate sorter, 0 enables the conditions 1 and 3, 1 the condition 2, and by default the 4 will be met because of head(1):

         0        1
5   0 days   0 days
8   0 days   1 days
9   0 days   3 days
10  0 days 123 days
2   3 days   3 days
4   3 days   3 days
7   4 days   4 days
1  16 days  16 days
3  16 days  16 days
0  30 days  30 days
6  30 days  30 days

CodePudding user response:

You can use:

out = (df.loc[df['Record Date'].sub(df['Cutoff Date']).abs()
                               .groupby(df['ID']).idxmin()])
print(out)

# Output
   ID Record Date Cutoff Date  Value
2   A  2022-01-28  2022-01-31      3
5   B  2022-01-31  2022-01-31      3
8   C  2022-02-01  2022-01-31      3
10  D  2022-06-03  2022-01-31      1
  • Related