How to get minutes per hour for date range (Python)?-CodePudding

I have a table like this, which contains the start time and end time of some process.

start_time	end_time
2019-07-01 11:25:00	2019-07-01 11:40:00
2019-07-01 21:40:00	2019-07-01 22:10:00
2019-07-03 22:00:00	2019-07-04 22:00:00

And I would like to get, for each hour in between the start_time and end_time, a count of minutes that belong to that hour. In other words, I would like to know how many minutes the process was running for specified end_hours

For example, the first row would return something like this, since 15 minutes passed until end time 12:00.

end_hour	total_minutes
2019-07-01 12:00:00	15

Similarly, for the second row, the output would be

end_hour	total_minutes
2019-07-01 22:00:00	20
2019-07-01 23:00:00	10

For the final row, the output would be

end_hour	total_minutes
2019-07-03 23:00:00	60
2019-07-03 00:00:00	60
2019-07-04 01:00:00	60
...	...
2019-07-04 22:00:00	60

How do I achieve something like this in Python?

CodePudding user response：

You can use to_datetime pandas built in function to convert the dates to datetime and the subtract end - start:

import pandas as pd
df = pd.DataFrame([['2019-07-01 11:25:00','2019-07-01 11:40:00'], ['2019-07-01 21:40:00', '2019-07-01 22:10:00'], ['2019-07-03 22:00:00', '2019-07-04 22:00:00']], columns=['start_time', 'end_time'])
df['total_minutes'] = (pd.to_datetime(df['end_time']) - pd.to_datetime(df['start_time'])).astype('timedelta64[m]')
>>> df
            start_time             end_time  total_minutes
0  2019-07-01 11:25:00  2019-07-01 11:40:00           15.0
1  2019-07-01 21:40:00  2019-07-01 22:10:00           30.0
2  2019-07-03 22:00:00  2019-07-04 22:00:00         1440.0

CodePudding user response：

The durations have minute precision, so let's up-sample to that frequency and count the minutes per hour that fall within one of the start_time - end_time intervals.

import pandas as  pd

df = pd.DataFrame(
       {"start_time": ["2019-07-01 11:25:00", "2019-07-01 21:40:00", "2019-07-03 22:00:00"],
        "end_time":   ["2019-07-01 11:40:00", "2019-07-01 22:10:00", "2019-07-04 22:00:00"]}
       )

df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['minutes'] = (df['end_time'] - df['start_time']).dt.total_seconds()/60

# create an IntervalIndex which we can set as the axis (needed for re-indexing).
# subtract one minute from end_time so that the minute of the termination is excluded.
iv_idx = pd.IntervalIndex.from_arrays(df['start_time'],
                                      df['end_time']-pd.Timedelta(minutes=1),
                                      closed='both')

# create a new index with the extended frequency:
new_idx = pd.date_range(df['start_time'].min(), df['end_time'].max(), freq='min')

# set the new index to get the extended frequency;
# all minutes will have the value of the whole interval
result = df['minutes'].set_axis(iv_idx).reindex(new_idx)

# we can now calculate the duration per hour by resampling and summing the
# boolean representation of the duration (1/0):
result= result.fillna(0).astype(int).astype(bool).resample('H').sum()
result.index.name = 'start_hour'

Now you have the results anchored to start_hour (you can easily change to end hour by shifting the index by one hour):

print(result.loc["2019-07-01 11:00:00":"2019-07-01 12:00:00"])
# start_hour
# 2019-07-01 11:00:00    15
# 2019-07-01 12:00:00     0
# Freq: H, Name: minutes, dtype: int64

print(result.loc["2019-07-01 20:00:00":"2019-07-01 23:00:00"])
# start_hour
# 2019-07-01 20:00:00     0
# 2019-07-01 21:00:00    20
# 2019-07-01 22:00:00    10
# 2019-07-01 23:00:00     0
# Freq: H, Name: minutes, dtype: int64