My data looks like this
name first_appear last_appear
Bob 12:10:00 12:33:49
Jacob 12:31:00 13:29:12
How can I group it by time range and count like this
Interval Count
10:00 - 11:00 0
11:00 - 12:00 0
12:00 - 13:00 2
13:00 - 14:00 1
14:00 - 15:00 0
Explanation:
Bob appeared from 12:10
to 12:33
so add 1 to count 12:00 - 13:00
Jacob appeared from 12:31
to 13:29
so add 1 to count 12:00 - 13:00
and to count 13:00 - 14:00
I'm using Python 3.9, the input is a Pandas dataframe, but can be restructured if needed
first_appear
and last_appear
are datetime.time
objects
Thanks
CodePudding user response:
Ok so I may have a solution for you, it's a little brute force, since I don't know the constraints of your question nor do I have a minimally reproducible example, I used a nested for loop (I know), but it may be possible to use a pandas dataframe.apply method here. Anyways:
Your Frames:
import pandas as pd
from datetime import datetime, time
Data = pd.DataFrame(
{
"name":["Bob", "Jacob"],
"first_appear":[datetime.strptime("12:10:00", '%H:%M:%S').time(), datetime.strptime('12:31:00', '%H:%M:%S').time()],
"last_appear":[datetime.strptime('12:33:49', '%H:%M:%S').time(), datetime.strptime('13:29:12', '%H:%M:%S').time()]
}
)
#Assuming you want the names as the index, and not as a column
Data = Data.set_index("name", drop=True)
IntFrame = pd.DataFrame(
{"Interval":[time(i) for i in range(10, 16)],
"Count":0}
)
And the nested loop:
for index1, index2 in zip(IntFrame.index.values[:-1], IntFrame.index.values[1:]):
for i in Data.index.values:
First = Data.loc[i, "first_appear"]
Last = Data.loc[i, "last_appear"]
if (First >= IntFrame.loc[index1, "Interval"] and First < IntFrame.loc[index2, "Interval"]):
IntFrame.loc[index1, "Count"] = 1
if Last >= IntFrame.loc[index2, "Interval"]:
#gets the last position where 'Last' is greater than a time interval
Index3 = max(IntFrame.where(Last >= IntFrame["Interval"]).dropna().index.values)
IntFrame.loc[index2:Index3, "Count"] = 1
Which produces:
Interval Count
0 10:00:00 0
1 11:00:00 0
2 12:00:00 2
3 13:00:00 1
4 14:00:00 0
5 15:00:00 0
I urge you not to treat your intervals like a 2-tuple, since its implied on a time scale. It's bad form to do that when working with data frames.
CodePudding user response:
I'd recommend that whenever you work with pandas
that you use their date/time objects, i.e. Timestamp
and Timedelta
. Vectorised calculations will be much faster with these, as opposed to those from the datetime
module, and it opens the door for a whole lot of extra functionality.
If you are willing to work with Timedelta (or Timestamp would work too) then you can use this solution using staircase
, which is designed to be a part of the pandas ecosystem for handling step functions.
setup
import pandas as pd
from datetime import time
data in datetime.time
format:
df = pd.DataFrame({
"name":["Bob", "Jacob"],
"first_appear":[time.fromisoformat("12:10:00"), time.fromisoformat('12:31:00')],
"last_appear":[time.fromisoformat('12:33:49'), time.fromisoformat('13:29:12')],
})
Convert to pandas.Timedelta
df["first_appear"] = pd.to_timedelta(df["first_appear"].map(str))
df["last_appear"] = pd.to_timedelta(df["last_appear"].map(str))
solution
Your data is essentially a step function of the count value. When someone first appears the step function increases in value. When they last appear the step function decreases in value.
We'll round the "first_appear" times down to the start of the hour, so that the value of the step function at each hourly point will give us the total number of people who were active at any point in that hour:
df["first_appear"] = df["first_appear"].dt.floor("H")
We can then create a step function using the staircase.Stairs
object, passing it a dataframe and telling it which are the start times and end times
import staircase as sc
sf = sc.Stairs(df, start="first_appear", end="last_appear")
If you were using pandas.Timestamp
you could plot the step function with sf.plot()
but this functionality is not available with pandas.Timedelta
.
Next, create your hourly sample times:
sample_times = pd.timedelta_range("10:00:00", "15:00:00", freq=pd.Timedelta("1hr"))
Lastly, you query the value of the Stairs (step function) object by calling it like a method with the times you want to evaluate it at:
sf(sample_times, include_index=True)
This will give you a pandas.Series
indexed by pandas.Timedelta
s:
0 days 10:00:00 0
0 days 11:00:00 0
0 days 12:00:00 2
0 days 13:00:00 1
0 days 14:00:00 0
0 days 15:00:00 0
Freq: H, dtype: int64
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.