I'm newbie in python. I have a data case like this:
I would like to know the closest date to the given date, but it should not exceed the given date. Here's the data:
Date_1 | Date_2 | Date_3 | Given_Date |
---|---|---|---|
23/12/2022 12:56:49 | 27/12/2022 11:22:32 | 30/12/2022 12:32:53 | 26/12/2022 13:22:39 |
03/12/2022 11:23:46 | 05/12/2022 13:42:12 | 07/12/2022 11:11:21 | 08/12/2022 13:15:20 |
15/11/2022 09:07:56 | 16/11/2022 10:37:51 | 18/11/2022 21:13:22 | 16/11/2022 13:21:23 |
And I would like to categorize them like this: \n \n
Date_1 | Date_2 | Date_3 | Given_Date | Closest_Date |
---|---|---|---|---|
23/12/2022 12:56:49 | 27/12/2022 11:22:32 | 30/12/2022 12:32:53 | 26/12/2022 13:22:39 | Date_1 |
03/12/2022 11:23:46 | 05/12/2022 13:42:12 | 07/12/2022 11:11:21 | 08/12/2022 13:15:20 | Date_3 |
15/11/2022 09:07:56 | 16/11/2022 10:37:51 | 18/11/2022 21:13:22 | 16/11/2022 13:21:23 | Date_2 |
How could I categorize them based on the given table using python? Thank you in advance guys
CodePudding user response:
You can convert to_datetime
, get the diff to Closest_Date, filter the dates, and get the idxmin
:
df2 = df.apply(pd.to_datetime, dayfirst=True)
df['Closest_Date'] = (df2
.filter(regex='^Date')
.rsub(df2['Given_Date'], axis=0)
.pipe(lambda d: d.mask(d.lt('0')))
.idxmin(axis=1)
)
NB. saving the conversion to datetime in a new dataframe is only required if the original strings need to be maintained.
output:
Date_1 Date_2 Date_3 \
0 23/12/2022 12:56:49 27/12/2022 11:22:32 30/12/2022 12:32:53
1 03/12/2022 11:23:46 05/12/2022 13:42:12 07/12/2022 11:11:21
2 15/11/2022 09:07:56 16/11/2022 10:37:51 18/11/2022 21:13:22
Given_Date Closest_Date
0 26/12/2022 13:22:39 Date_1
1 08/12/2022 13:15:20 Date_3
2 16/11/2022 13:21:23 Date_2
CodePudding user response:
Another approach using pandas.merge_asof
, which will have the advantage of giving a NaN if no match is found:
df2 = df.apply(pd.to_datetime, dayfirst=True)
s = (pd.merge_asof(df2['Given_Date'].reset_index().sort_values(by='Given_Date'),
df2.filter(regex='^Date').stack().sort_values()
.reset_index(name='Given_Date'),
left_by='index', right_by='level_0', on='Given_Date'
)
.set_index('level_0')['level_1']
)
df['Closest_Date'] = s
output:
Date_1 Date_2 Date_3 \
0 23/12/2022 12:56:49 27/12/2022 11:22:32 30/12/2022 12:32:53
1 03/12/2022 11:23:46 05/12/2022 13:42:12 07/12/2022 11:11:21
2 15/11/2022 09:07:56 16/11/2022 10:37:51 18/11/2022 21:13:22
Given_Date Closest_Date
0 26/12/2022 13:22:39 Date_1
1 08/12/2022 13:15:20 Date_3
2 16/11/2022 13:21:23 Date_2