I have two dataframes.
Dataframe allocatedTimes
collects the allocated time (for a given activity) for a person identified by an ID.
ID: id of the person
startTime: allocated slot start datetime
endTime: allocated slot end datetime
weekday: calculated corresponding weekday
Dataframe events
collects the events associated to each person/ID.
ID: id of the person
eventStartTime: event start datetime
eventEndTime: event end datetime
weekday: calculated corresponding weekday
What I want: for each ID, count the number of events between the allocated time slots.
The problem: each allocated time slot is recurring, i.e. each person sets a few hours a day for each day of the week, so the allocatedTimes
dataframe collects all of those dates. Let's say PersonA has allocated five hours each Monday from 8am to 1pm, and set it as recurrent for a year. In the allocatedTimes
dataframe we'd have all Monday Datetimes from now oct/2022 to oct/2023 as such:
ID startTime endTime weekday
personA 2022-10-24 08:00:00 2022-10-24 13:00:00 Monday
personA 2022-10-31 08:00:00 2022-10-31 13:00:00 Monday
personA 2022-11-07 08:00:00 2022-11-07 13:00:00 Monday
personA ... ... ...
and in the events
dataframe, we can find some events like:
ID eventStartTime eventEndTime weekday
personA 2022-10-24 08:15:00 2022-10-24 08:30:00 Monday
personA 2022-10-24 09:40:00 2022-10-24 10:30:00 Monday
personA 2022-10-31 09:15:00 2022-10-31 09:30:00 Monday
personA 2022-10-31 10:15:00 2022-10-31 10:30:00 Monday
personA 2022-10-31 11:15:00 2022-10-31 11:30:00 Monday
personA 2022-11-07 11:00:00 2022-11-07 12:00:00 Monday
personA 2022-11-07 12:15:00 2022-11-07 12:40:00 Monday
personA ... ... ...
The result I want is something like this:
ID startTime endTime weekday nrEvents
personA 2022-10-24 08:00:00 2022-10-24 13:00:00 Monday 2
personA 2022-10-31 08:00:00 2022-10-31 13:00:00 Monday 3
personA 2022-11-07 08:00:00 2022-11-07 13:00:00 Monday 2
personB ... ... ... ...
My very straightforward attempt:
df = []
for id in list(allocatedTimes.ID):
res = allocatedTimes[allocatedTimes.ID == id]
for index, row in res.iterrows():
mask = (events.eventStartTime >= row.startTime) & (event.eventEndTime <= row.endTime)
data = {
'ID' : id,
'startTime' : row.startTime,
'endTime' : row.endTime,
'weekday' : row.weekday,
'nrEvents' : len(events.loc[mask])
}
df.append(data)
df = pd.DataFrame(df)
But this takes a lot of time to compute even on small subsets of those dataframes. So what I'm looking for is a more efficient way to reproduce this same result. Any help or suggestion is appreciated.
CodePudding user response:
First, in both DataFrames, create a column with "start_date", then join both Dataframes on id
and start_date
.
res_df = pd.merge(allocatedTimes, events, how='left', on=['id', 'start_date'])
Now filter that DataFrame to only keep the rows where the event time is inside the allocated time's timeframe.
alloc_events = res_df.loc[(res_df.startTime <= res_df.eventStartTime) && (res_df.endTime >= res_df.eventEndTime)]
The rest is now a Groupby on id
, startTime
and endTime
.
final_df = alloc_events.groupby(['id', 'startTime', 'endTime'], as_index=False).size()
I know it's not iterative, but it produces the end state. Apologies if iterative was an essential part of the question. Also, if for some reason it's only important if the event started inside the allocated time window, then only the filtering would need to be adjusted. Instead of size some kind of sum aggregation or count might also be helpful, but size should be the easiest.