Home > Back-end >  Get the last date before an incident, as well as the first date after the incident (Python)
Get the last date before an incident, as well as the first date after the incident (Python)

Time:12-31

I have a datafram df_dates as follows. Note that the number of columns is variable at any given time.

df_dates = pd.DataFrame({'col1': ['2021-01-11', '2021-03-11', '2021-03-11','2021-05-11'],
               'col2': ['2021-02-11', '2021-05-11', '2021-02-11','2021-02-11'],
               'col3': ['2021-03-11', '2021-06-11', '2021-09-11','2021-06-11'],
               'col4': ['2021-04-11', '2021-01-11', '2021-01-11','2021-01-11']
               }).apply(pd.to_datetime)

print(df_dates.to_string())

>>
        col1       col2       col3       col4
0 2021-01-11 2021-02-11 2021-03-11 2021-04-11
1 2021-03-11 2021-05-11 2021-06-11 2021-01-11
2 2021-03-11 2021-02-11 2021-09-11 2021-01-11
3 2021-05-11 2021-02-11 2021-06-11 2021-01-11

I then have a single column df_incidents, with the same number of rows as df_dates, with the row indexes between the two dataframes being the same.

df_incidents = pd.DataFrame({'incident': ['2021-03-29', '2021-03-16', '2021-03-14','2021-03-02']}).apply(pd.to_datetime)

print(df_incidents.to_string())

>>
    incident
0 2021-03-29
1 2021-03-16
2 2021-03-14
3 2021-03-02

For each row, I would like the last date from df_dates that is right before the incident, as well as the first date right after the incident. My expected output for df_incidents would be:

    incident last_date_before_incidennt first_date_after_incident
0 2021-03-29                 2021-03-11                2021-04-11
1 2021-03-16                 2021-03-11                2021-05-11
2 2021-03-14                 2021-03-11                2021-09-11
3 2021-03-02                 2021-02-11                2021-05-11

How can I achieve this efficiently? Thank you.

CodePudding user response:

One option is to stack df_dates, get the greater than and less than dataframes with pd.merge_asof, and finally merge the two dataframes:

temp = df_dates.stack().droplevel(1)

Get before and after dates:

before = pd.merge_asof(df_incidents.sort_values('incident').reset_index(), 
                       temp.rename('last_date_before_incident').sort_values().reset_index(), 
                       by='index', 
                       left_on='incident', 
                       right_on='last_date_before_incident',
                       direction='backward')

after = pd.merge_asof(df_incidents.sort_values('incident').reset_index(), 
                       temp.rename('last_date_after_incident').sort_values().reset_index(), 
                       by='index', 
                       left_on='incident', 
                       right_on='last_date_after_incident',
                       direction='forward')

Merge both dataframes to get final output:

(pd.merge(before, 
         after, 
         on=['index', 'incident'])
 .sort_values('index', ignore_index = True)
 .drop(columns='index')
)

    incident last_date_before_incident last_date_after_incident
0 2021-03-29                2021-03-11               2021-04-11
1 2021-03-16                2021-03-11               2021-05-11
2 2021-03-14                2021-03-11               2021-09-11
3 2021-03-02                2021-02-11               2021-05-11

CodePudding user response:

Without loops, you can compute the difference between incident columns from df_incidents and all columns from df_dates. Next, find the min and the max for each rows (before and after variables). You obtain a list of tuples that matches to the stacked version of your df_dates dataframe:

diff = df_dates.sub(df_incidents['incident'].values, axis=1)

before = pd.Index(diff.mask(diff >= pd.Timedelta(0)).idxmax(axis=1).reset_index())
after = pd.Index(diff.mask(diff < pd.Timedelta(0)).idxmin(axis=1).reset_index())

df_incidents['last_date_before_incident'] = df_dates.stack().loc[before].values
df_incidents['last_date_after_incident'] = df_dates.stack().loc[after].values

Output:

>>> df_incidents
    incident last_date_before_incident last_date_after_incident
0 2021-03-29                2021-03-11               2021-04-11
1 2021-03-16                2021-03-11               2021-05-11
2 2021-03-14                2021-03-11               2021-09-11
3 2021-03-02                2021-02-11               2021-05-11

>>> diff
      col1     col2     col3     col4
0 -77 days -33 days  -3 days  40 days
1 -18 days  56 days  89 days -50 days
2 -18 days -33 days 181 days -50 days
3  43 days -33 days  89 days -50 days

>>> before
Index([(0, 'col3'), (1, 'col1'), (2, 'col1'), (3, 'col2')], dtype='object')

>>> after
Index([(0, 'col4'), (1, 'col2'), (2, 'col3'), (3, 'col1')], dtype='object')
  • Related