Group and count by time interval

My data looks like this

name     first_appear      last_appear
Bob      12:10:00          12:33:49
Jacob    12:31:00          13:29:12

How can I group it by time range and count like this

  Interval        Count
10:00 - 11:00       0
11:00 - 12:00       0
12:00 - 13:00       2
13:00 - 14:00       1
14:00 - 15:00       0

Explanation:

Bob appeared from 12:10 to 12:33 so add 1 to count 12:00 - 13:00

Jacob appeared from 12:31 to 13:29 so add 1 to count 12:00 - 13:00 and to count 13:00 - 14:00

I'm using Python 3.9, the input is a Pandas dataframe, but can be restructured if needed first_appear and last_appear are datetime.time objects

Thanks

CodePudding user response：

Ok so I may have a solution for you, it's a little brute force, since I don't know the constraints of your question nor do I have a minimally reproducible example, I used a nested for loop (I know), but it may be possible to use a pandas dataframe.apply method here. Anyways:

Your Frames:

import pandas as pd
from datetime import datetime, time

Data = pd.DataFrame(
{
 "name":["Bob", "Jacob"],
 "first_appear":[datetime.strptime("12:10:00", '%H:%M:%S').time(), datetime.strptime('12:31:00', '%H:%M:%S').time()],
 "last_appear":[datetime.strptime('12:33:49', '%H:%M:%S').time(), datetime.strptime('13:29:12', '%H:%M:%S').time()]
 }
)

#Assuming you want the names as the index, and not as a column 
Data = Data.set_index("name", drop=True)

IntFrame = pd.DataFrame(
    {"Interval":[time(i) for i in range(10, 16)],
     "Count":0}
    )

And the nested loop:

for index1, index2 in zip(IntFrame.index.values[:-1], IntFrame.index.values[1:]):
for i in Data.index.values:
    First = Data.loc[i, "first_appear"]
    Last = Data.loc[i, "last_appear"]
    if (First >= IntFrame.loc[index1, "Interval"] and First < IntFrame.loc[index2, "Interval"]):
        IntFrame.loc[index1, "Count"]  = 1
        if Last >= IntFrame.loc[index2, "Interval"]:
            #gets the last position where 'Last' is greater than a time interval
            Index3 = max(IntFrame.where(Last >= IntFrame["Interval"]).dropna().index.values)
            IntFrame.loc[index2:Index3, "Count"]  = 1

Which produces:

   Interval  Count
0  10:00:00      0
1  11:00:00      0
2  12:00:00      2
3  13:00:00      1
4  14:00:00      0
5  15:00:00      0

I urge you not to treat your intervals like a 2-tuple, since its implied on a time scale. It's bad form to do that when working with data frames.

CodePudding user response：

I'd recommend that whenever you work with pandas that you use their date/time objects, i.e. Timestamp and Timedelta. Vectorised calculations will be much faster with these, as opposed to those from the datetime module, and it opens the door for a whole lot of extra functionality.

If you are willing to work with Timedelta (or Timestamp would work too) then you can use this solution using staircase, which is designed to be a part of the pandas ecosystem for handling step functions.

setup

import pandas as pd
from datetime import time

data in datetime.time format:

df = pd.DataFrame({
    "name":["Bob", "Jacob"],
    "first_appear":[time.fromisoformat("12:10:00"), time.fromisoformat('12:31:00')],
    "last_appear":[time.fromisoformat('12:33:49'), time.fromisoformat('13:29:12')],
})

Convert to pandas.Timedelta

df["first_appear"] = pd.to_timedelta(df["first_appear"].map(str))
df["last_appear"] = pd.to_timedelta(df["last_appear"].map(str))

solution

Your data is essentially a step function of the count value. When someone first appears the step function increases in value. When they last appear the step function decreases in value.

We'll round the "first_appear" times down to the start of the hour, so that the value of the step function at each hourly point will give us the total number of people who were active at any point in that hour:

df["first_appear"] = df["first_appear"].dt.floor("H")

We can then create a step function using the staircase.Stairs object, passing it a dataframe and telling it which are the start times and end times

import staircase as sc

sf = sc.Stairs(df, start="first_appear", end="last_appear")

If you were using pandas.Timestamp you could plot the step function with sf.plot() but this functionality is not available with pandas.Timedelta.

Next, create your hourly sample times:

sample_times = pd.timedelta_range("10:00:00", "15:00:00", freq=pd.Timedelta("1hr"))

Lastly, you query the value of the Stairs (step function) object by calling it like a method with the times you want to evaluate it at:

sf(sample_times, include_index=True)

This will give you a pandas.Series indexed by pandas.Timedeltas:

0 days 10:00:00    0
0 days 11:00:00    0
0 days 12:00:00    2
0 days 13:00:00    1
0 days 14:00:00    0
0 days 15:00:00    0
Freq: H, dtype: int64

note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.