Creating a new df joining two data frames (iterating each row) with multiple conditions-CodePudding

I have a couple of decent sized dataframes that look like:

df_B

id     start_time                   end_time                     side      cost 
1234   2021-01-01 16:00:00.100000   2021-01-01 16:02:00.100000   BUY       100
1564   2021-01-01 16:05:00.100000   2021-01-01 16:10:00.100000   BUY       111
7535   2021-01-01 16:40:00.100000   2021-01-01 16:55:00.100000   BUY       124
9999   2021-01-01 16:44:00.100000   2021-01-01 16:45:00.100000   BUY       128

df_S

id     start_time                   end_time                     side      cost 
5366   2021-01-01 16:00:00.100000   2021-01-01 16:02:00.100000   SELL       100
4533   2021-01-01 16:05:00.100000   2021-01-01 16:08:00.100000   SELL       105
4532   2021-01-01 16:20:00.100000   2021-01-01 16:50:00.100000   SELL       122
5827   2021-01-01 16:30:00.100000   2021-01-01 16:35:00.100000   SELL       123

I would like to create a new dataframe such that: for each id in df_B: if df_S.cost <= df_B.cost & df_S.start_time <= df_B.end_time

Eg: Desired output:

id     start_time                   end_time                     side      cost  id_S   start_time_S             end_time_S             side_S      cost_S 
1234   2021-01-01 16:00:00.100000   2021-01-01 16:02:00.100000   BUY       100   5366   2021-01-01 16:00:00.100000   2021-01-01 16:02:00.100000   SELL       100
1564   2021-01-01 16:05:00.100000   2021-01-01 16:10:00.100000   BUY       111   4533   2021-01-01 16:05:00.100000   2021-01-01 16:08:00.100000   SELL       105
7535   2021-01-01 16:40:00.100000   2021-01-01 16:55:00.100000   BUY       124
9999   2021-01-01 16:44:00.100000   2021-01-01 16:45:00.100000   BUY       128

Could you please advise how I can efficiently write this, for a large dataframe

CodePudding user response：

you can create a mask then merge outer on index. If you don't have an index or you are using id as index, just do df.reset_index()

create a mask

mask = df_s.cost <= df_b.cost & df_s.start_time <= df_b.end_time

then merge with the mask on

pd.merge(df_b[mask],df_s[mask],suffixes=('_B','_S'),how = 'left', left_index=True, right_index=True)

Unfortunately this method will drop the row that the conditions are untrue for any one of the 2 dataframes.

the result should be

 cost  id_S   start_time_S             end_time_S             side_S      cost_S 
1234   2021-01-01 16:00:00.100000   2021-01-01 16:02:00.100000   BUY       100   5366   2021-01-01 16:00:00.100000   2021-01-01 16:02:00.100000   SELL       100
1564   2021-01-01 16:05:00.100000   2021-01-01 16:10:00.100000   BUY       111   4533   2021-01-01 16:05:00.100000   2021-01-01 16:08:00.100000   SELL       105

CodePudding user response：

Looking at the match of record with id 1564, I believe you want to create a 1:1 matching, given that other records satisfy the matching conditions as well.

The closest function I believe you are looking for is pd.merge_asof.

However, it only allows for merging based on one condition. It is not implemented for merging based on two conditions. And this is for good reason, I believe.

For example, why does your expected result match 4533 to 1564, but not 5366 instead, given that both satisfy the merge condition? In this case, you may argue that 1564's values for start_time and cost are both closer to the respective values of 4533 compared to 5366, i.e., 1564 is a "better match".

But, in general, what rule would you want to impose in case the matching condition is satisfied for two candidate records, say A and B, where, for example, record A is a "better match" with respect to condition 1 and record B with respect to 2? You would have to provide more information/rules to write an algorithm.

If you want 1:m matching and you have a decent data size, consider SQLite, I find it to more appropriate for such problems.