I have a datafram df_dates
as follows. Note that the number of columns is variable at any given time.
df_dates = pd.DataFrame({'col1': ['2021-01-11', '2021-03-11', '2021-03-11','2021-05-11'],
'col2': ['2021-02-11', '2021-05-11', '2021-02-11','2021-02-11'],
'col3': ['2021-03-11', '2021-06-11', '2021-09-11','2021-06-11'],
'col4': ['2021-04-11', '2021-01-11', '2021-01-11','2021-01-11']
}).apply(pd.to_datetime)
print(df_dates.to_string())
>>
col1 col2 col3 col4
0 2021-01-11 2021-02-11 2021-03-11 2021-04-11
1 2021-03-11 2021-05-11 2021-06-11 2021-01-11
2 2021-03-11 2021-02-11 2021-09-11 2021-01-11
3 2021-05-11 2021-02-11 2021-06-11 2021-01-11
I then have a single column df_incidents
, with the same number of rows as df_dates
, with the row indexes between the two dataframes being the same.
df_incidents = pd.DataFrame({'incident': ['2021-03-29', '2021-03-16', '2021-03-14','2021-03-02']}).apply(pd.to_datetime)
print(df_incidents.to_string())
>>
incident
0 2021-03-29
1 2021-03-16
2 2021-03-14
3 2021-03-02
For each row, I would like the last date from df_dates
that is right before the incident, as well as the first date right after the incident. My expected output for df_incidents
would be:
incident last_date_before_incidennt first_date_after_incident
0 2021-03-29 2021-03-11 2021-04-11
1 2021-03-16 2021-03-11 2021-05-11
2 2021-03-14 2021-03-11 2021-09-11
3 2021-03-02 2021-02-11 2021-05-11
How can I achieve this efficiently? Thank you.
CodePudding user response:
One option is to stack df_dates
, get the greater than and less than dataframes with pd.merge_asof, and finally merge the two dataframes:
temp = df_dates.stack().droplevel(1)
Get before and after dates:
before = pd.merge_asof(df_incidents.sort_values('incident').reset_index(),
temp.rename('last_date_before_incident').sort_values().reset_index(),
by='index',
left_on='incident',
right_on='last_date_before_incident',
direction='backward')
after = pd.merge_asof(df_incidents.sort_values('incident').reset_index(),
temp.rename('last_date_after_incident').sort_values().reset_index(),
by='index',
left_on='incident',
right_on='last_date_after_incident',
direction='forward')
Merge both dataframes to get final output:
(pd.merge(before,
after,
on=['index', 'incident'])
.sort_values('index', ignore_index = True)
.drop(columns='index')
)
incident last_date_before_incident last_date_after_incident
0 2021-03-29 2021-03-11 2021-04-11
1 2021-03-16 2021-03-11 2021-05-11
2 2021-03-14 2021-03-11 2021-09-11
3 2021-03-02 2021-02-11 2021-05-11
CodePudding user response:
Without loops, you can compute the difference between incident
columns from df_incidents
and all columns from df_dates
. Next, find the min and the max for each rows (before
and after
variables). You obtain a list of tuples that matches to the stacked version of your df_dates
dataframe:
diff = df_dates.sub(df_incidents['incident'].values, axis=1)
before = pd.Index(diff.mask(diff >= pd.Timedelta(0)).idxmax(axis=1).reset_index())
after = pd.Index(diff.mask(diff < pd.Timedelta(0)).idxmin(axis=1).reset_index())
df_incidents['last_date_before_incident'] = df_dates.stack().loc[before].values
df_incidents['last_date_after_incident'] = df_dates.stack().loc[after].values
Output:
>>> df_incidents
incident last_date_before_incident last_date_after_incident
0 2021-03-29 2021-03-11 2021-04-11
1 2021-03-16 2021-03-11 2021-05-11
2 2021-03-14 2021-03-11 2021-09-11
3 2021-03-02 2021-02-11 2021-05-11
>>> diff
col1 col2 col3 col4
0 -77 days -33 days -3 days 40 days
1 -18 days 56 days 89 days -50 days
2 -18 days -33 days 181 days -50 days
3 43 days -33 days 89 days -50 days
>>> before
Index([(0, 'col3'), (1, 'col1'), (2, 'col1'), (3, 'col2')], dtype='object')
>>> after
Index([(0, 'col4'), (1, 'col2'), (2, 'col3'), (3, 'col1')], dtype='object')