I have contracts, that have valid_from and valid_to dates and orders with order_date. The goal here is to match order to matching contract based on the dates. The problem is, there can be multiple valid contracts per order, so we need to have both of them in a result dataframe. Also, orders with order_date outside contract's validity dates, should not have any contracts attached to id. Sample DFs:
import pandas as pd
df1 = pd.DataFrame({
'ORDER_ID': [1, 2, 3],
'ORDER_DATE': [pd.to_datetime('2022-05-02'), pd.to_datetime('2022-05-06'), pd.to_datetime('2022-05-20')]
})
df2 = pd.DataFrame({
'CONTRACT_ID': [1, 2],
'CONTRACT_VALID_FROM': [pd.to_datetime('2022-05-01'), pd.to_datetime('2022-05-05')],
'CONTRACT_VALID_TO': [pd.to_datetime('2022-05-10'), pd.to_datetime('2022-05-15')]
})
Desirable output:
ORDER_ID ORDER_DATE CONTRACT_ID CONTRACT_VALID_FROM CONTRACT_VALID_TO
1 2022-05-02 1 2022-05-01 2022-05-10
2 2022-05-06 1 2022-05-01 2022-05-10
2 2022-05-06 2 2022-05-05 2022-05-15
3 2022-05-20 NaN NaN NaN
Source dataframes contain around 20M and 1M respectively.
CodePudding user response:
Easier solve with pyjanitor
- conda install -c conda-forge pyjanitor
Use janitor's conditional_join
-
df1.conditional_join(
df2,
('ORDER_DATE', 'CONTRACT_VALID_FROM', '>='),
('ORDER_DATE', 'CONTRACT_VALID_TO', '<='),
how='left'
)
Output
ORDER_ID ORDER_DATE CONTRACT_ID CONTRACT_VALID_FROM CONTRACT_VALID_TO
0 1 2022-05-02 1.0 2022-05-01 2022-05-10
1 2 2022-05-06 1.0 2022-05-01 2022-05-10
2 2 2022-05-06 2.0 2022-05-05 2022-05-15
3 3 2022-05-20 NaN NaT NaT