I have the following dataset:
date event next_event duration_Minutes
2021-09-09 22:30:00 1 2021-09-09 23:00:00 30
2021-09-09 23:00:00 2 2021-09-09 23:10:00 10
2021-09-09 23:10:00 1 2021-09-09 23:50:00 40
2021-09-09 23:50:00 4 2021-09-10 00:50:00 60
2021-09-10 00:50:00 4 2021-09-12 00:50:00 2880
The main problem is that I would like to split the multi-day events into separate events in the following way. I would like to have the event duration from 2021-09-09 23:50:00
until 2021-09-10 00: 00: 00
and then the duration from 2021-09-10 00: 00: 00
to 2021-09-10 00:50:00
, and so on. This would be useful because after, I would need to group the events by day and calculate the duration of the each event by day, so I would like to fix these situation in which there is the day change between events.
I would like to obtain something like this:
date event next_event duration_Minutes
2021-09-09 22:30:00 1 2021-09-09 23:00:00 30
2021-09-09 23:00:00 2 2021-09-09 23:10:00 10
2021-09-09 23:10:00 1 2021-09-09 23:50:00 40
2021-09-09 23:50:00 4 2021-09-10 00:00:00 10
2021-09-09 00:00:00 4 2021-09-10 00:50:00 50
2021-09-10 00:50:00 4 2021-09-11 00:00:00 1390
2021-09-11 00:00:00 4 2021-09-12 00:00:00 1440
2021-09-12 00:00:00 4 2021-09-12 00:50:00 50
It should be able to handle situations in which we don't have an event for an entire day or more like in the example.
My current solution for now is:
first_record_hour_ts = df.index.floor('H')[0]
last_record_hour_ts = df.index.floor('H')[-1]
# Create a series from the first to the last date containing Nan
df_to_join = pd.Series(np.nan, index=pd.date_range(first_record_hour_ts, last_record_hour_ts, freq='H'))
df_to_join = pd.DataFrame(df_to_join)
# Concatenate with current status dataframe
df = pd.concat([df, df_to_join[~df_to_join.index.isin(df.index)]]).sort_index()
# Forward fill the nana
df.fillna(method='ffill', inplace=True)
df['next_event'] = df.index.shift(-1)
# Calculate the delta between the 2 status
df['duration'] = df['next_event'] - df.index
# Convert into minutes
df['duration_Minutes'] = df['duration_Minutes'].apply(lambda x: x.total_seconds() // 60)
This doesn't solve exactly the problem, but I think it may solve my goal which being able to group by event and by day at the end.
CodePudding user response:
Ok, the code below looks a bit long -- and there's certainly a better/more efficient/shorter way of doing this. But I think it's pretty reasonably simple to follow along.
split_datetime_span_by_day
below takes two dates: start_date
and end_date
. In your case, it would be date
and next_event
from your source data.
The function then checks whether that time period (start -> end) spans over midnight. If it doesn't, it returns the start date, the end date, and the time period in seconds. If it does span over midnight, it creates a new segment (start -> midnight), and then calls itself again (i.e. recurses), and the process continues until the time period does not span over midnight.
Just a note: the returned segment list is made up of tuples of (start, end, nmb_seconds). I'm returning the number of seconds, not the number of minutes as in your question, because I didn't know how you wanted to round the seconds (up, down, etc.). That's left as an exercise for the reader :-)
from datetime import datetime, timedelta
def split_datetime_span_by_day(start_date, end_date, split_segments=None):
assert start_date < end_date # sanity check
# when is the next midnight after start_date?
# adapted from https://ispycode.com/Blog/python/2016-07/Get-Midnight-Today
start_next_midnight = datetime.combine(start_date, datetime.min.time()) timedelta(days=1)
if split_segments is None:
split_segments = []
if end_date < start_next_midnight:
# end date is before next midnight, no split necessary
return split_segments [(
start_date,
end_date,
(end_date - start_date).total_seconds()
)]
# otherwise, split at next midnight...
split_segments = [(
start_date,
start_next_midnight,
(start_next_midnight - start_date).total_seconds()
)]
if (end_date - start_next_midnight).total_seconds() > 0:
# ...and recurse to get next segment
return split_datetime_span_by_day(
start_date=start_next_midnight,
end_date=end_date,
split_segments=split_segments
)
else:
# case where start_next_midnight == end_date i.e. end_date is midnight
# don't split & create a 0 second segment
return split_segments
# test case:
start_date = datetime.strptime('2021-09-12 00:00:00', '%Y-%m-%d %H:%M:%S')
end_date = datetime.strptime('2021-09-14 01:00:00', '%Y-%m-%d %H:%M:%S')
print(split_datetime_span_by_day(start_date=start_date, end_date=end_date))
# returned values:
# [
# (datetime.datetime(2021, 9, 12, 0, 0), datetime.datetime(2021, 9, 13, 0, 0), 86400.0),
# (datetime.datetime(2021, 9, 13, 0, 0), datetime.datetime(2021, 9, 14, 0, 0), 86400.0),
# (datetime.datetime(2021, 9, 14, 0, 0), datetime.datetime(2021, 9, 14, 1, 0), 3600.0)
# ]