I am attempting to perform an inner merge of two large dataframes having columns 'ID' and 'Date'. A sample of each is shown below:
df1
ID Date
0 RHD78 2022-08-05
1 RHD78 2022-08-06
2 RHD78 2022-08-09
3 RHD78 2022-08-11
4 RHD78 2022-08-12
5 RHD78 2022-08-14
6 RHD78 2022-08-15
7 RHD78 2022-08-19
8 BDW56 2022-03-15
9 BDW56 2022-03-16
10 BDW56 2022-03-17
11 BDW56 2022-03-22
12 BDW56 2022-03-23
13 BDW56 2022-03-27
14 BDW56 2022-03-29
15 BDW56 2022-03-30
df2
ID Date
0 RHD78 2022-08-12
1 BDW56 2022-03-23
If I use the code df_result = pd.merge(df1, df2, how = 'inner', on='Date')
then I get the two intersecting datapoints. However I am struggling to introduce a timedelta such that the resulting dataframe also includes data 4 days before and after the intersecting dates like so:
df_desired
ID Date
0 RHD78 8/9/2022
1 RHD78 8/11/2022
2 RHD78 8/12/2022
3 RHD78 8/14/2022
4 RHD78 8/15/2022
5 BDW56 3/22/2022
6 BDW56 3/23/2022
7 BDW56 3/27/2022
I tried to look into using merge_asof() function but my understanding is that it gets only the values that are closest to the date and not within a particular date range. I am learning pandas and python so I would appreciate if someone can help me solve this issue and provide simplified explanation of merge_asof().
CodePudding user response:
Using df
as your first dataframe and df2
as the second, i followed the same procedure as in this answer, which was to cross merge them together and then filter after the merge has occurred. A cross merge is just a blanket merge, which combines each row pair from each dataframe together. This might not be applicable if your two dataframes are immensely large.
merge = df2.merge(df,how='cross')
merge['timedelta'] = pd.to_datetime(merge['Date_x']) - \
pd.to_datetime(merge['Date_y'])
merge_filt = merge.loc[merge['timedelta'].apply(lambda x: x.days).abs()<=4]
print(merge_filt)
Out[43]:
ID_x Date_x ID_y Date_y timedelta
2 RHD78 2022-08-12 RHD78 2022-08-09 3 days
3 RHD78 2022-08-12 RHD78 2022-08-11 1 days
4 RHD78 2022-08-12 RHD78 2022-08-12 0 days
5 RHD78 2022-08-12 RHD78 2022-08-14 -2 days
6 RHD78 2022-08-12 RHD78 2022-08-15 -3 days
27 BDW56 2022-03-23 BDW56 2022-03-22 1 days
28 BDW56 2022-03-23 BDW56 2022-03-23 0 days
29 BDW56 2022-03-23 BDW56 2022-03-27 -4 days
CodePudding user response:
suppose your Date column is datetime object, then we could do something like this:
d = pd.to_timedelta(4,'days')
df2['Date'] = df2['Date'].map(lambda x: pd.date_range(x-d,x d))
df1.merge(df2.set_index('ID').explode('Date').reset_index())
>>>
'''
ID Date
0 RHD78 2022-08-09
1 RHD78 2022-08-11
2 RHD78 2022-08-12
3 RHD78 2022-08-14
4 RHD78 2022-08-15
5 BDW56 2022-03-22
6 BDW56 2022-03-23
7 BDW56 2022-03-27
CodePudding user response:
For this use case, you can avoid a cartesian join, or map - an efficient approach is to do an inner merge and filter after:
df2 = df2.assign(start = df2.Date -pd.Timedelta(days=4),
end = df2.Date pd.Timedelta(days=4))
(df
.merge(df2.drop(columns='Date'), on='ID')
.loc[lambda d: d.Date.between(d.start, d.end, inclusive='both'), df.columns])
ID Date
2 RHD78 2022-08-09
3 RHD78 2022-08-11
4 RHD78 2022-08-12
5 RHD78 2022-08-14
6 RHD78 2022-08-15
11 BDW56 2022-03-22
12 BDW56 2022-03-23
13 BDW56 2022-03-27