I have a couple of decent sized dataframes that look like:
df_B
id start_time end_time side cost
1234 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 BUY 100
1564 2021-01-01 16:05:00.100000 2021-01-01 16:10:00.100000 BUY 111
7535 2021-01-01 16:40:00.100000 2021-01-01 16:55:00.100000 BUY 124
9999 2021-01-01 16:44:00.100000 2021-01-01 16:45:00.100000 BUY 128
df_S
id start_time end_time side cost
5366 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 SELL 100
4533 2021-01-01 16:05:00.100000 2021-01-01 16:08:00.100000 SELL 105
4532 2021-01-01 16:20:00.100000 2021-01-01 16:50:00.100000 SELL 122
5827 2021-01-01 16:30:00.100000 2021-01-01 16:35:00.100000 SELL 123
I would like to create a new dataframe such that: for each id in df_B: if df_S.cost <= df_B.cost & df_S.start_time <= df_B.end_time
Eg: Desired output:
id start_time end_time side cost id_S start_time_S end_time_S side_S cost_S
1234 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 BUY 100 5366 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 SELL 100
1564 2021-01-01 16:05:00.100000 2021-01-01 16:10:00.100000 BUY 111 4533 2021-01-01 16:05:00.100000 2021-01-01 16:08:00.100000 SELL 105
7535 2021-01-01 16:40:00.100000 2021-01-01 16:55:00.100000 BUY 124
9999 2021-01-01 16:44:00.100000 2021-01-01 16:45:00.100000 BUY 128
Could you please advise how I can efficiently write this, for a large dataframe
CodePudding user response:
you can create a mask then merge outer on index.
If you don't have an index or you are using id as index, just do df.reset_index()
create a mask
mask = df_s.cost <= df_b.cost & df_s.start_time <= df_b.end_time
then merge with the mask on
pd.merge(df_b[mask],df_s[mask],suffixes=('_B','_S'),how = 'left', left_index=True, right_index=True)
Unfortunately this method will drop the row that the conditions are untrue for any one of the 2 dataframes.
the result should be
cost id_S start_time_S end_time_S side_S cost_S
1234 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 BUY 100 5366 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 SELL 100
1564 2021-01-01 16:05:00.100000 2021-01-01 16:10:00.100000 BUY 111 4533 2021-01-01 16:05:00.100000 2021-01-01 16:08:00.100000 SELL 105
CodePudding user response:
Looking at the match of record with id 1564
, I believe you want to create a 1:1 matching, given that other records satisfy the matching conditions as well.
The closest function I believe you are looking for is pd.merge_asof
.
However, it only allows for merging based on one condition. It is not implemented for merging based on two conditions. And this is for good reason, I believe.
For example, why does your expected result match 4533
to 1564
, but not 5366
instead, given that both satisfy the merge condition? In this case, you may argue that 1564
's values for start_time
and cost
are both closer to the respective values of 4533
compared to 5366
, i.e., 1564
is a "better match".
But, in general, what rule would you want to impose in case the matching condition is satisfied for two candidate records, say A
and B
, where, for example, record A
is a "better match" with respect to condition 1 and record B
with respect to 2? You would have to provide more information/rules to write an algorithm.
If you want 1:m matching and you have a decent data size, consider SQLite
, I find it to more appropriate for such problems.