How to Find The Closest Date in Rows Using Python?-CodePudding

I'm newbie in python. I have a data case like this:

I would like to know the closest date to the given date, but it should not exceed the given date. Here's the data:

Date_1	Date_2	Date_3	Given_Date
23/12/2022 12:56:49	27/12/2022 11:22:32	30/12/2022 12:32:53	26/12/2022 13:22:39
03/12/2022 11:23:46	05/12/2022 13:42:12	07/12/2022 11:11:21	08/12/2022 13:15:20
15/11/2022 09:07:56	16/11/2022 10:37:51	18/11/2022 21:13:22	16/11/2022 13:21:23

And I would like to categorize them like this: \n \n

Date_1	Date_2	Date_3	Given_Date	Closest_Date
23/12/2022 12:56:49	27/12/2022 11:22:32	30/12/2022 12:32:53	26/12/2022 13:22:39	Date_1
03/12/2022 11:23:46	05/12/2022 13:42:12	07/12/2022 11:11:21	08/12/2022 13:15:20	Date_3
15/11/2022 09:07:56	16/11/2022 10:37:51	18/11/2022 21:13:22	16/11/2022 13:21:23	Date_2

How could I categorize them based on the given table using python? Thank you in advance guys

CodePudding user response：

You can convert to_datetime, get the diff to Closest_Date, filter the dates, and get the idxmin:

df2 = df.apply(pd.to_datetime, dayfirst=True)

df['Closest_Date'] = (df2
                      .filter(regex='^Date')
                      .rsub(df2['Given_Date'], axis=0)
                      .pipe(lambda d: d.mask(d.lt('0')))
                      .idxmin(axis=1)
                      )

NB. saving the conversion to datetime in a new dataframe is only required if the original strings need to be maintained.

output:

                Date_1               Date_2               Date_3  \
0  23/12/2022 12:56:49  27/12/2022 11:22:32  30/12/2022 12:32:53   
1  03/12/2022 11:23:46  05/12/2022 13:42:12  07/12/2022 11:11:21   
2  15/11/2022 09:07:56  16/11/2022 10:37:51  18/11/2022 21:13:22   

            Given_Date Closest_Date  
0  26/12/2022 13:22:39       Date_1  
1  08/12/2022 13:15:20       Date_3  
2  16/11/2022 13:21:23       Date_2

CodePudding user response：

Another approach using pandas.merge_asof, which will have the advantage of giving a NaN if no match is found:

df2 = df.apply(pd.to_datetime, dayfirst=True)

s = (pd.merge_asof(df2['Given_Date'].reset_index().sort_values(by='Given_Date'),
                   df2.filter(regex='^Date').stack().sort_values()
                      .reset_index(name='Given_Date'),
                   left_by='index', right_by='level_0', on='Given_Date'
                   )
       .set_index('level_0')['level_1']
    )

df['Closest_Date'] = s

output:

                Date_1               Date_2               Date_3  \
0  23/12/2022 12:56:49  27/12/2022 11:22:32  30/12/2022 12:32:53   
1  03/12/2022 11:23:46  05/12/2022 13:42:12  07/12/2022 11:11:21   
2  15/11/2022 09:07:56  16/11/2022 10:37:51  18/11/2022 21:13:22   

            Given_Date Closest_Date  
0  26/12/2022 13:22:39       Date_1  
1  08/12/2022 13:15:20       Date_3  
2  16/11/2022 13:21:23       Date_2