I have two dataframes:
df_A = pd.DataFrame({'start_date':['2021-04-01 00:00:00','2021-05-01 00:00:00', '2021-05-02 00:00:00'],'end_date':['2021-04-01 00:11:00','2021-05-01 00:06:00', '2021-05-03 00:00:00'], 'eventname':['birthday', 'wedding', 'birthday'] })
df_B = pd.DataFrame({'event_date':['2021-04-01 00:06:00','2021-05-01 00:03:00', '2021-05-04 00:00:00'],'price':[100,200,500]})
df_A['end_date'] = pd.to_datetime(df_A.end_date)
df_A['start_date'] = pd.to_datetime(df_A.start_date)
df_B['event_date'] = pd.to_datetime(df_B.event_date)
df_A
start_date end_date eventname
0 2021-04-01 2021-04-01 00:11:00 birthday
1 2021-05-01 2021-05-01 00:06:00 wedding
2 2021-05-02 2021-05-03 00:00:00 birthday
df_B
event_date price
0 2021-04-01 00:06:00 100
1 2021-05-01 00:03:00 200
2 2021-05-04 00:00:00 500
I want to join them such that event_date
column lies in between the start_date
and end_date
column of df_A. This is my code:
df_A.index = pd.IntervalIndex.from_arrays(df_A['start_date'],df_A['end_date'],closed='both')
df_B['start_date'] = df_B['event_date'].apply(lambda x : df_A.iloc[df_A.index.get_loc(x)]['start_date'])
df_B
However, I get a keyerror as some of values on event_date
column of df_B do not lie between any of the start_date
and end_date
of columns in df_A.
THis is my expected output:
event_date price start_date
0 2021-04-01 00:06:00 100 2021-04-01
1 2021-05-01 00:03:00 200 2021-05-01
2 2021-05-04 00:00:00 500 NaN
I tried using intersection but that works only if the values are exactly equal. How do I do this?
CodePudding user response:
If your start_date
and end_date
do not overlap, create an interval index and merge your two dataframes:
bins = pd.IntervalIndex.from_arrays(df_A['start_date'],
df_A['end_date'],
closed='both')
out = df_B.assign(interval=pd.cut(df_B['event_date'], bins)) \
.merge(df_A.assign(interval=bins), on='interval', how='left')
print(out[['event_date', 'price', 'start_date']])
# Output:
event_date price start_date
0 2021-04-01 00:06:00 100 2021-04-01
1 2021-05-01 00:03:00 200 2021-05-01
2 2021-05-04 00:00:00 500 NaT
CodePudding user response:
One option is via the conditional_join from pyjanitor :
# pip install pyjanitor
import pandas as pd
import janitor
(df_A.conditional_join(
df_B,
# column from left frame, column from right frame, comparision operator
('start_date', 'event_date', '<='),
('end_date', 'event_date', '>='),
how = 'right')
.filter(['event_date', 'price', 'start_date'])
)
event_date price start_date
0 2021-04-01 00:06:00 100 2021-04-01
1 2021-05-01 00:03:00 200 2021-05-01
2 2021-05-04 00:00:00 500 NaT