I have a table like this, which contains the start time and end time of some process.
start_time | end_time |
---|---|
2019-07-01 11:25:00 | 2019-07-01 11:40:00 |
2019-07-01 21:40:00 | 2019-07-01 22:10:00 |
2019-07-03 22:00:00 | 2019-07-04 22:00:00 |
And I would like to get, for each hour in between the start_time
and end_time
, a count of minutes that belong to that hour. In other words, I would like to know how many minutes the process was running for specified end_hours
For example, the first row would return something like this, since 15 minutes passed until end time 12:00.
end_hour | total_minutes |
---|---|
2019-07-01 12:00:00 | 15 |
Similarly, for the second row, the output would be
end_hour | total_minutes |
---|---|
2019-07-01 22:00:00 | 20 |
2019-07-01 23:00:00 | 10 |
For the final row, the output would be
end_hour | total_minutes |
---|---|
2019-07-03 23:00:00 | 60 |
2019-07-03 00:00:00 | 60 |
2019-07-04 01:00:00 | 60 |
... | ... |
2019-07-04 22:00:00 | 60 |
How do I achieve something like this in Python?
CodePudding user response:
You can use to_datetime
pandas built in function to convert the dates to datetime and the subtract end - start:
import pandas as pd
df = pd.DataFrame([['2019-07-01 11:25:00','2019-07-01 11:40:00'], ['2019-07-01 21:40:00', '2019-07-01 22:10:00'], ['2019-07-03 22:00:00', '2019-07-04 22:00:00']], columns=['start_time', 'end_time'])
df['total_minutes'] = (pd.to_datetime(df['end_time']) - pd.to_datetime(df['start_time'])).astype('timedelta64[m]')
>>> df
start_time end_time total_minutes
0 2019-07-01 11:25:00 2019-07-01 11:40:00 15.0
1 2019-07-01 21:40:00 2019-07-01 22:10:00 30.0
2 2019-07-03 22:00:00 2019-07-04 22:00:00 1440.0
CodePudding user response:
The durations have minute precision, so let's up-sample to that frequency and count the minutes per hour that fall within one of the start_time - end_time intervals.
import pandas as pd
df = pd.DataFrame(
{"start_time": ["2019-07-01 11:25:00", "2019-07-01 21:40:00", "2019-07-03 22:00:00"],
"end_time": ["2019-07-01 11:40:00", "2019-07-01 22:10:00", "2019-07-04 22:00:00"]}
)
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['minutes'] = (df['end_time'] - df['start_time']).dt.total_seconds()/60
# create an IntervalIndex which we can set as the axis (needed for re-indexing).
# subtract one minute from end_time so that the minute of the termination is excluded.
iv_idx = pd.IntervalIndex.from_arrays(df['start_time'],
df['end_time']-pd.Timedelta(minutes=1),
closed='both')
# create a new index with the extended frequency:
new_idx = pd.date_range(df['start_time'].min(), df['end_time'].max(), freq='min')
# set the new index to get the extended frequency;
# all minutes will have the value of the whole interval
result = df['minutes'].set_axis(iv_idx).reindex(new_idx)
# we can now calculate the duration per hour by resampling and summing the
# boolean representation of the duration (1/0):
result= result.fillna(0).astype(int).astype(bool).resample('H').sum()
result.index.name = 'start_hour'
Now you have the results anchored to start_hour (you can easily change to end hour by shifting the index by one hour):
print(result.loc["2019-07-01 11:00:00":"2019-07-01 12:00:00"])
# start_hour
# 2019-07-01 11:00:00 15
# 2019-07-01 12:00:00 0
# Freq: H, Name: minutes, dtype: int64
print(result.loc["2019-07-01 20:00:00":"2019-07-01 23:00:00"])
# start_hour
# 2019-07-01 20:00:00 0
# 2019-07-01 21:00:00 20
# 2019-07-01 22:00:00 10
# 2019-07-01 23:00:00 0
# Freq: H, Name: minutes, dtype: int64