Get the last date before an incident, as well as the first date after the incident (Python)-CodePudding

I have a datafram df_dates as follows. Note that the number of columns is variable at any given time.

df_dates = pd.DataFrame({'col1': ['2021-01-11', '2021-03-11', '2021-03-11','2021-05-11'],
               'col2': ['2021-02-11', '2021-05-11', '2021-02-11','2021-02-11'],
               'col3': ['2021-03-11', '2021-06-11', '2021-09-11','2021-06-11'],
               'col4': ['2021-04-11', '2021-01-11', '2021-01-11','2021-01-11']
               }).apply(pd.to_datetime)

print(df_dates.to_string())

>>
        col1       col2       col3       col4
0 2021-01-11 2021-02-11 2021-03-11 2021-04-11
1 2021-03-11 2021-05-11 2021-06-11 2021-01-11
2 2021-03-11 2021-02-11 2021-09-11 2021-01-11
3 2021-05-11 2021-02-11 2021-06-11 2021-01-11

I then have a single column df_incidents, with the same number of rows as df_dates, with the row indexes between the two dataframes being the same.

df_incidents = pd.DataFrame({'incident': ['2021-03-29', '2021-03-16', '2021-03-14','2021-03-02']}).apply(pd.to_datetime)

print(df_incidents.to_string())

>>
    incident
0 2021-03-29
1 2021-03-16
2 2021-03-14
3 2021-03-02

For each row, I would like the last date from df_dates that is right before the incident, as well as the first date right after the incident. My expected output for df_incidents would be:

    incident last_date_before_incidennt first_date_after_incident
0 2021-03-29                 2021-03-11                2021-04-11
1 2021-03-16                 2021-03-11                2021-05-11
2 2021-03-14                 2021-03-11                2021-09-11
3 2021-03-02                 2021-02-11                2021-05-11

How can I achieve this efficiently? Thank you.

CodePudding user response：

One option is to stack df_dates, get the greater than and less than dataframes with pd.merge_asof, and finally merge the two dataframes:

temp = df_dates.stack().droplevel(1)

Get before and after dates:

before = pd.merge_asof(df_incidents.sort_values('incident').reset_index(), 
                       temp.rename('last_date_before_incident').sort_values().reset_index(), 
                       by='index', 
                       left_on='incident', 
                       right_on='last_date_before_incident',
                       direction='backward')

after = pd.merge_asof(df_incidents.sort_values('incident').reset_index(), 
                       temp.rename('last_date_after_incident').sort_values().reset_index(), 
                       by='index', 
                       left_on='incident', 
                       right_on='last_date_after_incident',
                       direction='forward')

Merge both dataframes to get final output:

(pd.merge(before, 
         after, 
         on=['index', 'incident'])
 .sort_values('index', ignore_index = True)
 .drop(columns='index')
)

    incident last_date_before_incident last_date_after_incident
0 2021-03-29                2021-03-11               2021-04-11
1 2021-03-16                2021-03-11               2021-05-11
2 2021-03-14                2021-03-11               2021-09-11
3 2021-03-02                2021-02-11               2021-05-11

CodePudding user response：

Without loops, you can compute the difference between incident columns from df_incidents and all columns from df_dates. Next, find the min and the max for each rows (before and after variables). You obtain a list of tuples that matches to the stacked version of your df_dates dataframe:

diff = df_dates.sub(df_incidents['incident'].values, axis=1)

before = pd.Index(diff.mask(diff >= pd.Timedelta(0)).idxmax(axis=1).reset_index())
after = pd.Index(diff.mask(diff < pd.Timedelta(0)).idxmin(axis=1).reset_index())

df_incidents['last_date_before_incident'] = df_dates.stack().loc[before].values
df_incidents['last_date_after_incident'] = df_dates.stack().loc[after].values

Output:

>>> df_incidents
    incident last_date_before_incident last_date_after_incident
0 2021-03-29                2021-03-11               2021-04-11
1 2021-03-16                2021-03-11               2021-05-11
2 2021-03-14                2021-03-11               2021-09-11
3 2021-03-02                2021-02-11               2021-05-11

>>> diff
      col1     col2     col3     col4
0 -77 days -33 days  -3 days  40 days
1 -18 days  56 days  89 days -50 days
2 -18 days -33 days 181 days -50 days
3  43 days -33 days  89 days -50 days

>>> before
Index([(0, 'col3'), (1, 'col1'), (2, 'col1'), (3, 'col2')], dtype='object')

>>> after
Index([(0, 'col4'), (1, 'col2'), (2, 'col3'), (3, 'col1')], dtype='object')