I am trying to merge two dataframes by nearest time point.
I tried merge_asof
. However, the issue is two entries have the same time point, then
merge_asof
will only merge one entry into the final dataframe.
Here is an example
df_A:
time ticker price
09:10:13 VOD 110.96
df_B:
time ticker volume
09:10:12 VOD 35412
09:10:12 VOD 12343
I tried
pd.merge(df_A, df_B, how = 'left', on = ['time'], by=['ticker'], direction='backward')
And it only returns
time ticker price volume
09:10:13 VOD 110.96 12343
As you can see one entry/row is deleted here. I want to use merge_asof and keep all entries under the same nearest time point.
Ideal output:
time ticker price volume
09:10:12 VOD 110.96 35412
09:10:12 VOD 110.96 12343
Any advise ? Thank you!
CodePudding user response:
You need to perform a double merge
, first with the merge_asof
, then a classical merge
. The trick is to identify the duplicates in the asof merged right DataFrame:
df_B2 = df_B.assign(idx=df_B.groupby('time').ngroup())
out = (pd
.merge_asof(df_A, df_B2, on='time', by='ticker', direction='backward')
.merge(df_B2, on=['ticker', 'idx'])
#.drop(columns='idx')
)
NB. I used datetime
columns for the example
output:
time_x ticker price volume_x idx time_y volume_y
0 2022-06-27 09:10:13 VOD 110.96 12343 0 2022-06-27 09:10:12 35412
1 2022-06-27 09:10:13 VOD 110.96 12343 0 2022-06-27 09:10:12 12343
used input:
df_A = pd.DataFrame({'time': [Timestamp('2022-06-27 09:10:13')],
'ticker': ['VOD'], 'price': [110.96]})
df_B = pd.DataFrame({'time': [Timestamp('2022-06-27 09:10:12'), Timestamp('2022-06-27 09:10:12')],
'ticker': ['VOD', 'VOD'],
'volume': [35412, 12343]})