How to join two dataframes when only some dates in one dataframe is present between two other dates-CodePudding

I have two dataframes:

df_A = pd.DataFrame({'start_date':['2021-04-01 00:00:00','2021-05-01 00:00:00', '2021-05-02 00:00:00'],'end_date':['2021-04-01 00:11:00','2021-05-01 00:06:00', '2021-05-03 00:00:00'], 'eventname':['birthday', 'wedding', 'birthday'] })
df_B = pd.DataFrame({'event_date':['2021-04-01 00:06:00','2021-05-01 00:03:00', '2021-05-04 00:00:00'],'price':[100,200,500]})
df_A['end_date'] = pd.to_datetime(df_A.end_date)
df_A['start_date'] = pd.to_datetime(df_A.start_date)
df_B['event_date'] = pd.to_datetime(df_B.event_date)

df_A

    start_date  end_date    eventname
0   2021-04-01  2021-04-01 00:11:00 birthday
1   2021-05-01  2021-05-01 00:06:00 wedding
2   2021-05-02  2021-05-03 00:00:00 birthday

df_B

event_date  price
0   2021-04-01 00:06:00 100
1   2021-05-01 00:03:00 200
2   2021-05-04 00:00:00 500

I want to join them such that event_date column lies in between the start_date and end_date column of df_A. This is my code:

df_A.index = pd.IntervalIndex.from_arrays(df_A['start_date'],df_A['end_date'],closed='both')
df_B['start_date'] = df_B['event_date'].apply(lambda x : df_A.iloc[df_A.index.get_loc(x)]['start_date'])
df_B

However, I get a keyerror as some of values on event_date column of df_B do not lie between any of the start_date and end_date of columns in df_A.

THis is my expected output:

event_date  price   start_date
0   2021-04-01 00:06:00 100 2021-04-01
1   2021-05-01 00:03:00 200 2021-05-01
2   2021-05-04 00:00:00 500 NaN

I tried using intersection but that works only if the values are exactly equal. How do I do this?

CodePudding user response：

If your start_date and end_date do not overlap, create an interval index and merge your two dataframes:

bins = pd.IntervalIndex.from_arrays(df_A['start_date'], 
                                    df_A['end_date'],
                                    closed='both')

out = df_B.assign(interval=pd.cut(df_B['event_date'], bins)) \
          .merge(df_A.assign(interval=bins), on='interval', how='left')

print(out[['event_date', 'price', 'start_date']])

# Output:
           event_date  price start_date
0 2021-04-01 00:06:00    100 2021-04-01
1 2021-05-01 00:03:00    200 2021-05-01
2 2021-05-04 00:00:00    500        NaT

CodePudding user response：

One option is via the conditional_join from pyjanitor :

# pip install pyjanitor
import pandas as pd
import janitor

(df_A.conditional_join(
       df_B,  
       # column from left frame, column from right frame, comparision operator
       ('start_date', 'event_date', '<='), 
       ('end_date', 'event_date', '>='), 
       how = 'right')
   .filter(['event_date', 'price', 'start_date'])
)

           event_date  price start_date
0 2021-04-01 00:06:00    100 2021-04-01
1 2021-05-01 00:03:00    200 2021-05-01
2 2021-05-04 00:00:00    500        NaT