I have a dataframe that records the movement of people from one location to another. I would like to calculate the time spent by each person_id
in each end_location_id
when not moving. The calculation is based on the difference between start_ts
of the next new travel entry and the end_ts
of the last travel entry, per location_id
. If no new travel entry is found per person, then the calculation should be the difference between current_time
and end_ts
.
I would like to replace the code that calculates duration
with code that uses pandas' groupby (or equivalent) functions. Would this be possible? My current code is messy.
import pandas as pd
current_time = '2022-05-05 17:00'
df = pd.DataFrame(
{
"person_id": ["A", "B", "A", "C", "A", "C"],
"start_location_id": [1, 5, 2, 7, 3, 8],
"end_location_id": [2, 6, 3, 8, 2, 9],
"start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
"end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05 12:00"]),
}
)
for i in df.index:
df.loc[i, "duration"] = (df.loc[(df["person_id"] == df.loc[i, "person_id"]) & (df["start_location_id"] == df.loc[i, "end_location_id"]) & (df.index > i), "start_ts"].min() - df.loc[i, "end_ts"]).total_seconds()/3600
df.loc[df["duration"].isna(), "duration"] = (pd.to_datetime(current_time) -
df.loc[df["duration"].isna(), "end_ts"]).dt.total_seconds()/3600
My desired output would be:
print(df)
>>
person_id start_location_id end_location_id start_ts \
0 A 1 2 2022-05-05 00:00:00
1 B 5 6 2022-05-05 00:00:00
2 A 2 3 2022-05-05 05:00:00
3 C 7 8 2022-05-05 00:00:00
4 A 3 2 2022-05-05 13:00:00
5 C 8 9 2022-05-05 11:00:00
end_ts duration
0 2022-05-05 02:00:00 3.0
1 2022-05-05 03:00:00 14.0
2 2022-05-05 10:00:00 3.0
3 2022-05-05 04:00:00 7.0
4 2022-05-05 16:00:00 1.0
5 2022-05-05 12:00:00 5.0
CodePudding user response:
You can shift per group (GroupBy.shift
), filling the NaN with your current time:
current_time = pd.Timestamp('2022-05-05 17:00') # works also with simple string
df['duration'] = (df
.groupby('person_id')['start_ts']
.shift(-1, fill_value=current_time)
.sub(df['end_ts'])
.dt.total_seconds().div(3600)
)
output:
person_id start_location_id end_location_id start_ts \
0 A 1 2 2022-05-05 00:00:00
1 B 5 6 2022-05-05 00:00:00
2 A 2 3 2022-05-05 05:00:00
3 C 7 8 2022-05-05 00:00:00
4 A 3 2 2022-05-05 13:00:00
5 C 8 9 2022-05-05 11:00:00
end_ts duration
0 2022-05-05 02:00:00 3.0
1 2022-05-05 03:00:00 14.0
2 2022-05-05 10:00:00 3.0
3 2022-05-05 04:00:00 7.0
4 2022-05-05 16:00:00 1.0
5 2022-05-05 12:00:00 5.0