Joining dataframes based on nearest date and returning multiple rows as result-CodePudding

I have contracts, that have valid_from and valid_to dates and orders with order_date. The goal here is to match order to matching contract based on the dates. The problem is, there can be multiple valid contracts per order, so we need to have both of them in a result dataframe. Also, orders with order_date outside contract's validity dates, should not have any contracts attached to id. Sample DFs:

import pandas as pd
df1 = pd.DataFrame({
    'ORDER_ID': [1, 2, 3],
    'ORDER_DATE': [pd.to_datetime('2022-05-02'), pd.to_datetime('2022-05-06'), pd.to_datetime('2022-05-20')]
})
df2 = pd.DataFrame({
    'CONTRACT_ID': [1, 2],
    'CONTRACT_VALID_FROM': [pd.to_datetime('2022-05-01'), pd.to_datetime('2022-05-05')],
    'CONTRACT_VALID_TO': [pd.to_datetime('2022-05-10'), pd.to_datetime('2022-05-15')]
})

Desirable output:

ORDER_ID     ORDER_DATE    CONTRACT_ID  CONTRACT_VALID_FROM   CONTRACT_VALID_TO
1            2022-05-02    1            2022-05-01            2022-05-10
2            2022-05-06    1            2022-05-01            2022-05-10
2            2022-05-06    2            2022-05-05            2022-05-15
3            2022-05-20    NaN          NaN                   NaN

Source dataframes contain around 20M and 1M respectively.

CodePudding user response：

Easier solve with pyjanitor - conda install -c conda-forge pyjanitor

Use janitor's conditional_join -

df1.conditional_join(
        df2, 
        ('ORDER_DATE', 'CONTRACT_VALID_FROM', '>='), 
        ('ORDER_DATE', 'CONTRACT_VALID_TO', '<='),
        how='left'
    )

Output

   ORDER_ID ORDER_DATE  CONTRACT_ID CONTRACT_VALID_FROM CONTRACT_VALID_TO
0         1 2022-05-02          1.0          2022-05-01        2022-05-10
1         2 2022-05-06          1.0          2022-05-01        2022-05-10
2         2 2022-05-06          2.0          2022-05-05        2022-05-15
3         3 2022-05-20          NaN                 NaT               NaT