Efficient way to iteratively count entries between given dates-CodePudding

I have two dataframes.

Dataframe allocatedTimes collects the allocated time (for a given activity) for a person identified by an ID.

ID: id of the person
startTime: allocated slot start datetime
endTime: allocated slot end datetime
weekday: calculated corresponding weekday

Dataframe events collects the events associated to each person/ID.

ID: id of the person
eventStartTime: event start datetime
eventEndTime: event end datetime
weekday: calculated corresponding weekday

What I want: for each ID, count the number of events between the allocated time slots.

The problem: each allocated time slot is recurring, i.e. each person sets a few hours a day for each day of the week, so the allocatedTimes dataframe collects all of those dates. Let's say PersonA has allocated five hours each Monday from 8am to 1pm, and set it as recurrent for a year. In the allocatedTimes dataframe we'd have all Monday Datetimes from now oct/2022 to oct/2023 as such:

ID           startTime             endTime                   weekday  
personA      2022-10-24 08:00:00   2022-10-24 13:00:00       Monday      
personA      2022-10-31 08:00:00   2022-10-31 13:00:00       Monday      
personA      2022-11-07 08:00:00   2022-11-07 13:00:00       Monday 
personA      ...                   ...                       ...

and in the events dataframe, we can find some events like:

ID           eventStartTime         eventEndTime              weekday  
personA      2022-10-24 08:15:00    2022-10-24 08:30:00       Monday      
personA      2022-10-24 09:40:00    2022-10-24 10:30:00       Monday      
personA      2022-10-31 09:15:00    2022-10-31 09:30:00       Monday 
personA      2022-10-31 10:15:00    2022-10-31 10:30:00       Monday 
personA      2022-10-31 11:15:00    2022-10-31 11:30:00       Monday 
personA      2022-11-07 11:00:00    2022-11-07 12:00:00       Monday 
personA      2022-11-07 12:15:00    2022-11-07 12:40:00       Monday 
personA      ...                   ...                       ...

The result I want is something like this:

ID           startTime             endTime                   weekday      nrEvents
personA      2022-10-24 08:00:00   2022-10-24 13:00:00       Monday       2
personA      2022-10-31 08:00:00   2022-10-31 13:00:00       Monday       3
personA      2022-11-07 08:00:00   2022-11-07 13:00:00       Monday       2
personB      ...                   ...                       ...          ...

My very straightforward attempt:

df = []
for id in list(allocatedTimes.ID):
    res = allocatedTimes[allocatedTimes.ID == id] 
    for index, row in res.iterrows():
        mask = (events.eventStartTime >= row.startTime) & (event.eventEndTime <= row.endTime)
        
        data = {
            'ID' : id,  
            'startTime' : row.startTime,
            'endTime' : row.endTime,
            'weekday' : row.weekday,
            'nrEvents' : len(events.loc[mask])
        }
        
        df.append(data)

df = pd.DataFrame(df)

But this takes a lot of time to compute even on small subsets of those dataframes. So what I'm looking for is a more efficient way to reproduce this same result. Any help or suggestion is appreciated.

CodePudding user response：

First, in both DataFrames, create a column with "start_date", then join both Dataframes on id and start_date.

res_df = pd.merge(allocatedTimes, events, how='left', on=['id', 'start_date'])

Now filter that DataFrame to only keep the rows where the event time is inside the allocated time's timeframe.

alloc_events = res_df.loc[(res_df.startTime <= res_df.eventStartTime) && (res_df.endTime >= res_df.eventEndTime)]

The rest is now a Groupby on id, startTime and endTime.

final_df = alloc_events.groupby(['id', 'startTime', 'endTime'], as_index=False).size()

I know it's not iterative, but it produces the end state. Apologies if iterative was an essential part of the question. Also, if for some reason it's only important if the event started inside the allocated time window, then only the filtering would need to be adjusted. Instead of size some kind of sum aggregation or count might also be helpful, but size should be the easiest.