I have a DataFrame which looks like this:
ID | Record Date | Cutoff Date | Value |
---|---|---|---|
A | 2022-01-01 | 2022-01-31 | 1 |
A | 2022-01-15 | 2022-01-31 | 2 |
A | 2022-01-28 | 2022-01-31 | 3 |
B | 2022-01-15 | 2022-01-31 | 1 |
B | 2022-01-28 | 2022-01-31 | 2 |
B | 2022-01-31 | 2022-01-31 | 3 |
C | 2022-01-01 | 2022-01-31 | 1 |
C | 2022-01-27 | 2022-01-31 | 2 |
C | 2022-02-01 | 2022-01-31 | 3 |
C | 2022-02-03 | 2022-01-31 | 4 |
D | 2022-06-03 | 2022-01-31 | 1 |
For each ID
, I want to keep only 1 row, based on the following criteria:
- If there is a row with
Record Date
==Cutoff Date
, keep this row. - If all
Record Date
is beforeCutoff Date
, keep the row which is closet toCutoff Date
. - If there exist any
Record Date
afterCutoff Date
, keep the earliest row afterCutoff Date
. - If there is only 1 row for an
ID
, keep this row.
Note that Cutoff Date
is constant for each ID
. (I set all Cutoff Date
identical to make it simplier).
The desired output:
ID | Record Date | Cutoff Date | Value |
---|---|---|---|
A | 2022-01-28 | 2022-01-31 | 3 |
B | 2022-01-31 | 2022-01-31 | 3 |
C | 2022-02-01 | 2022-01-31 | 3 |
D | 2022-06-03 | 2022-01-31 | 1 |
CodePudding user response:
IIUC, you can craft a sorter DataFrame based on the timedeltas and use it to select the top index:
s = df['Record Date'].sub(df['Cutoff Date'])
sorter = pd.concat([-s.clip(upper='0'), s.abs()], axis=1).sort_values(by=[0,1])
out = df.loc[sorter.groupby(df['ID']).head(1).sort_index().index]
output:
ID Record Date Cutoff Date Value
2 A 2022-01-28 2022-01-31 3
5 B 2022-01-31 2022-01-31 3
8 C 2022-02-01 2022-01-31 3
10 D 2022-06-03 2022-01-31 1
intermediate sorter, 0
enables the conditions 1 and 3, 1
the condition 2, and by default the 4 will be met because of head(1)
:
0 1
5 0 days 0 days
8 0 days 1 days
9 0 days 3 days
10 0 days 123 days
2 3 days 3 days
4 3 days 3 days
7 4 days 4 days
1 16 days 16 days
3 16 days 16 days
0 30 days 30 days
6 30 days 30 days
CodePudding user response:
You can use:
out = (df.loc[df['Record Date'].sub(df['Cutoff Date']).abs()
.groupby(df['ID']).idxmin()])
print(out)
# Output
ID Record Date Cutoff Date Value
2 A 2022-01-28 2022-01-31 3
5 B 2022-01-31 2022-01-31 3
8 C 2022-02-01 2022-01-31 3
10 D 2022-06-03 2022-01-31 1