Use a groupby instead of for loop to determine time spent in last locations-CodePudding

I have a dataframe that records the movement of people from one location to another. I would like to calculate the time spent by each person_id in each end_location_id when not moving. The calculation is based on the difference between start_ts of the next new travel entry and the end_ts of the last travel entry, per location_id. If no new travel entry is found per person, then the calculation should be the difference between current_time and end_ts.

I would like to replace the code that calculates duration with code that uses pandas' groupby (or equivalent) functions. Would this be possible? My current code is messy.

import pandas as pd

current_time = '2022-05-05 17:00'

df = pd.DataFrame(
{
    "person_id": ["A", "B", "A", "C", "A", "C"],
    "start_location_id": [1, 5, 2, 7, 3, 8],
    "end_location_id": [2, 6, 3, 8, 2, 9],
    "start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
    "end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05  12:00"]),
}
)

for i in df.index:
    df.loc[i, "duration"] = (df.loc[(df["person_id"] == df.loc[i, "person_id"]) & (df["start_location_id"] == df.loc[i, "end_location_id"]) & (df.index > i), "start_ts"].min() - df.loc[i, "end_ts"]).total_seconds()/3600

df.loc[df["duration"].isna(), "duration"] = (pd.to_datetime(current_time) -                 
df.loc[df["duration"].isna(), "end_ts"]).dt.total_seconds()/3600

My desired output would be:

print(df)
>>
  person_id  start_location_id  end_location_id            start_ts  \
0         A                  1                2 2022-05-05 00:00:00   
1         B                  5                6 2022-05-05 00:00:00   
2         A                  2                3 2022-05-05 05:00:00   
3         C                  7                8 2022-05-05 00:00:00   
4         A                  3                2 2022-05-05 13:00:00   
5         C                  8                9 2022-05-05 11:00:00   

               end_ts  duration  
0 2022-05-05 02:00:00       3.0  
1 2022-05-05 03:00:00      14.0  
2 2022-05-05 10:00:00       3.0  
3 2022-05-05 04:00:00       7.0  
4 2022-05-05 16:00:00       1.0  
5 2022-05-05 12:00:00       5.0

CodePudding user response：

You can shift per group (GroupBy.shift), filling the NaN with your current time:

current_time = pd.Timestamp('2022-05-05 17:00') # works also with simple string

df['duration'] = (df
                  .groupby('person_id')['start_ts']
                  .shift(-1, fill_value=current_time)
                  .sub(df['end_ts'])
                  .dt.total_seconds().div(3600)
                 )

output:

  person_id  start_location_id  end_location_id            start_ts  \
0         A                  1                2 2022-05-05 00:00:00   
1         B                  5                6 2022-05-05 00:00:00   
2         A                  2                3 2022-05-05 05:00:00   
3         C                  7                8 2022-05-05 00:00:00   
4         A                  3                2 2022-05-05 13:00:00   
5         C                  8                9 2022-05-05 11:00:00   

               end_ts  duration  
0 2022-05-05 02:00:00       3.0  
1 2022-05-05 03:00:00      14.0  
2 2022-05-05 10:00:00       3.0  
3 2022-05-05 04:00:00       7.0  
4 2022-05-05 16:00:00       1.0  
5 2022-05-05 12:00:00       5.0