Find the closest date from a list to a given date that is not after the given date-CodePudding

I have a dataframe for weekly training sessions and a data frame for evaluations submitted by attendees at those training sessions.

Each dataframe has a date column - for sessions, it is the date the session occurred. For evaluations, it is the date the evaluation was submitted. Attendees can be expected to attend multiple sessions and will therefore have submitted multiple evaluations.

I need to tie each evaluation back to a specific session. They may have submitted an evaluation on the same day as a session, in which case the match is easy. But they are able to submit an evaluation on any day up to the next training session.

For each date in the evaluation df, I need to return the session date that is closest to the evaluation date but not after the evaluation date.

example session dates: 2/3/22, 2/10/22, 2/17/22

example evaluation dates with desired output: 2/3/22 (should match 2/3/22), 2/4/22 (should match 2/3/22), 2/11/22 (should match 2/10/22)

CodePudding user response：

Here's a way to do it.

In the sessions dataframe, set date column to be the index:

sessions = sessions.set_index('date')

Sort sessions by index (that is, by date):

sessions = sessions.loc[sessions.index.sort_values()]

Add a session_evaluated column to evaluations which will contain the date of the session that the evaluation applies to. We calculate this by first calling sessions.index.get_indexer() on the date column of evaluations with the method argument set to 'pad' so we "round down" on non-matching dates, and then doing a lookup on these integer index values in the sessions index (which contains the session dates):

evaluations['session_evaluated'] = pd.Series([sessions.index.to_list()[i] 
    for i in sessions.index.get_indexer(evaluations['date'], method='pad')])

Here's what it looks like all put together with sample inputs:

import pandas as pd
sessions = pd.DataFrame({
    'date' : ['2022-02-01', '2022-03-01', '2022-04-01', '2022-05-01', '2022-01-01'],
    'topic' : ['Easy 1', 'Easy 2', 'Intermediate', 'Advanced', 'Intro']
})
evaluations = pd.DataFrame({
    'date' : [
        '2022-01-05', '2022-01-10', '2022-01-15', '2022-01-20', '2022-01-25', 
        '2022-02-01', '2022-02-05', '2022-02-28',
        '2022-03-01', '2022-03-15', '2022-03-31',
        '2022-04-01', '2022-04-15'
    ],
    'rating' : [9,8,7,8,9,5,4,3,10,10,10,2,4]
})
sessions['date'] = pd.to_datetime(sessions['date'])
evaluations['date'] = pd.to_datetime(evaluations['date'])
sessions = sessions.set_index('date')
sessions = sessions.loc[sessions.index.sort_values()]
print(sessions)
print(evaluations)
evaluations['session_evaluated'] = pd.Series([sessions.index.to_list()[i]
    for i in sessions.index.get_indexer(evaluations['date'], method='pad')])
print(evaluations)

Results:

                   topic
date
2022-01-01         Intro
2022-02-01        Easy 1
2022-03-01        Easy 2
2022-04-01  Intermediate
2022-05-01      Advanced
         date  rating
0  2022-01-05       9
1  2022-01-10       8
2  2022-01-15       7
3  2022-01-20       8
4  2022-01-25       9
5  2022-02-01       5
6  2022-02-05       4
7  2022-02-28       3
8  2022-03-01      10
9  2022-03-15      10
10 2022-03-31      10
11 2022-04-01       2
12 2022-04-15       4
         date  rating session_evaluated
0  2022-01-05       9        2022-01-01
1  2022-01-10       8        2022-01-01
2  2022-01-15       7        2022-01-01
3  2022-01-20       8        2022-01-01
4  2022-01-25       9        2022-01-01
5  2022-02-01       5        2022-02-01
6  2022-02-05       4        2022-02-01
7  2022-02-28       3        2022-02-01
8  2022-03-01      10        2022-03-01
9  2022-03-15      10        2022-03-01
10 2022-03-31      10        2022-03-01
11 2022-04-01       2        2022-04-01
12 2022-04-15       4        2022-04-01

UPDATED:

Here's another way to do it using the merge_asof() function. It doesn't require the date column to be the index (though it does require that both dataframe arguments be sorted by date):

sessions['date'] = pd.to_datetime(sessions['date'])
evaluations['date'] = pd.to_datetime(evaluations['date'])
evaluations = pd.merge_asof(
    evaluations.sort_values(by=['date']), 
    sessions.sort_values(by=['date'])['date'].to_frame().assign(session_evaluated=sessions['date']), 
    on='date')
print(evaluations)

Output:

         date  rating session_evaluated
0  2022-01-05       9        2022-01-01
1  2022-01-10       8        2022-01-01
2  2022-01-15       7        2022-01-01
3  2022-01-20       8        2022-01-01
4  2022-01-25       9        2022-01-01
5  2022-02-01       5        2022-02-01
6  2022-02-05       4        2022-02-01
7  2022-02-28       3        2022-02-01
8  2022-03-01      10        2022-03-01
9  2022-03-15      10        2022-03-01
10 2022-03-31      10        2022-03-01
11 2022-04-01       2        2022-04-01
12 2022-04-15       4        2022-04-01

UPDATE #2: The call to assign() in the above code can also be written using **kwargs syntax, in case we want to use a column name with spaces or that otherwise is not a valid python identifier (instead of session_evaluated). For example:

evaluations = pd.merge_asof(
    evaluations.sort_values(by=['date']), 
    sessions.sort_values(by=['date'])['date'].to_frame()
        .assign(**{'Evaluated Session (Date)' : lambda x: sessions['date']}), 
    on='date')

Output:

         date  rating Evaluated Session (Date)
0  2022-01-05       9               2022-01-01
1  2022-01-10       8               2022-01-01
2  2022-01-15       7               2022-01-01
3  2022-01-20       8               2022-01-01
4  2022-01-25       9               2022-01-01
5  2022-02-01       5               2022-02-01
6  2022-02-05       4               2022-02-01
7  2022-02-28       3               2022-02-01
8  2022-03-01      10               2022-03-01
9  2022-03-15      10               2022-03-01
10 2022-03-31      10               2022-03-01
11 2022-04-01       2               2022-04-01
12 2022-04-15       4               2022-04-01