Pandas: Merge two dataframes with timedelta-CodePudding

I am attempting to perform an inner merge of two large dataframes having columns 'ID' and 'Date'. A sample of each is shown below:

df1

    ID      Date
0   RHD78   2022-08-05
1   RHD78   2022-08-06
2   RHD78   2022-08-09
3   RHD78   2022-08-11
4   RHD78   2022-08-12
5   RHD78   2022-08-14
6   RHD78   2022-08-15
7   RHD78   2022-08-19
8   BDW56   2022-03-15
9   BDW56   2022-03-16
10  BDW56   2022-03-17
11  BDW56   2022-03-22
12  BDW56   2022-03-23
13  BDW56   2022-03-27
14  BDW56   2022-03-29
15  BDW56   2022-03-30

df2

    ID      Date
0   RHD78   2022-08-12
1   BDW56   2022-03-23

If I use the code df_result = pd.merge(df1, df2, how = 'inner', on='Date') then I get the two intersecting datapoints. However I am struggling to introduce a timedelta such that the resulting dataframe also includes data 4 days before and after the intersecting dates like so:

df_desired

    ID      Date
0   RHD78   8/9/2022
1   RHD78   8/11/2022
2   RHD78   8/12/2022
3   RHD78   8/14/2022
4   RHD78   8/15/2022
5   BDW56   3/22/2022
6   BDW56   3/23/2022
7   BDW56   3/27/2022

I tried to look into using merge_asof() function but my understanding is that it gets only the values that are closest to the date and not within a particular date range. I am learning pandas and python so I would appreciate if someone can help me solve this issue and provide simplified explanation of merge_asof().

CodePudding user response：

Using df as your first dataframe and df2 as the second, i followed the same procedure as in this answer, which was to cross merge them together and then filter after the merge has occurred. A cross merge is just a blanket merge, which combines each row pair from each dataframe together. This might not be applicable if your two dataframes are immensely large.

merge = df2.merge(df,how='cross')
merge['timedelta'] = pd.to_datetime(merge['Date_x']) - \
                     pd.to_datetime(merge['Date_y'])
merge_filt = merge.loc[merge['timedelta'].apply(lambda x: x.days).abs()<=4]


print(merge_filt)
Out[43]: 
     ID_x      Date_x   ID_y      Date_y timedelta
2   RHD78  2022-08-12  RHD78  2022-08-09    3 days
3   RHD78  2022-08-12  RHD78  2022-08-11    1 days
4   RHD78  2022-08-12  RHD78  2022-08-12    0 days
5   RHD78  2022-08-12  RHD78  2022-08-14   -2 days
6   RHD78  2022-08-12  RHD78  2022-08-15   -3 days
27  BDW56  2022-03-23  BDW56  2022-03-22    1 days
28  BDW56  2022-03-23  BDW56  2022-03-23    0 days
29  BDW56  2022-03-23  BDW56  2022-03-27   -4 days

CodePudding user response：

suppose your Date column is datetime object, then we could do something like this:

d = pd.to_timedelta(4,'days')
df2['Date'] = df2['Date'].map(lambda x: pd.date_range(x-d,x d))
df1.merge(df2.set_index('ID').explode('Date').reset_index())

>>>
'''
      ID       Date
0  RHD78 2022-08-09
1  RHD78 2022-08-11
2  RHD78 2022-08-12
3  RHD78 2022-08-14
4  RHD78 2022-08-15
5  BDW56 2022-03-22
6  BDW56 2022-03-23
7  BDW56 2022-03-27

CodePudding user response：

For this use case, you can avoid a cartesian join, or map - an efficient approach is to do an inner merge and filter after:

df2 = df2.assign(start = df2.Date -pd.Timedelta(days=4), 
                 end = df2.Date   pd.Timedelta(days=4))
(df
.merge(df2.drop(columns='Date'), on='ID')
.loc[lambda d: d.Date.between(d.start, d.end, inclusive='both'), df.columns])

       ID       Date
2   RHD78 2022-08-09
3   RHD78 2022-08-11
4   RHD78 2022-08-12
5   RHD78 2022-08-14
6   RHD78 2022-08-15
11  BDW56 2022-03-22
12  BDW56 2022-03-23
13  BDW56 2022-03-27