Home > database >  Create a dataframe containing time spent in location based on a dataframe with trip information in P
Create a dataframe containing time spent in location based on a dataframe with trip information in P

Time:05-19

I have a dataframe df that records the movement of people from one location to another. I would like to create a new dataframe df2 that records the time when a location was entered and when the same location was left, based on the trip information in df.

import pandas as pd
import numpy as np

current_time = '2022-05-05 17:00'

df = pd.DataFrame({"person_id": ["A", "B", "A", "C", "A", "C"],
                   "start_location_id": [1, 5, 2, 7, 3, 8],
                   "end_location_id": [2, 6, 3, 8, 2, 9],
                   "start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
                   "end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05  12:00"]),
})

df would look like this:

print(df.to_string())
>>
  person_id  start_location_id  end_location_id            start_ts              end_ts
0         A                  1                2 2022-05-05 00:00:00 2022-05-05 02:00:00
1         B                  5                6 2022-05-05 00:00:00 2022-05-05 03:00:00
2         A                  2                3 2022-05-05 05:00:00 2022-05-05 10:00:00
3         C                  7                8 2022-05-05 00:00:00 2022-05-05 04:00:00
4         A                  3                2 2022-05-05 13:00:00 2022-05-05 16:00:00
5         C                  8                9 2022-05-05 11:00:00 2022-05-05 12:00:00

I then create df2 using the following nested for loops. I would like to replace the code that sets up df2 with code that uses pandas' groupby (or equivalent) functions. My current attempt is as follows:

df2 = pd.DataFrame(columns = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"])

for i in np.unique(df["person_id"]):
    df3 = df.loc[df["person_id"] == i].reset_index(drop = True)
    for j in df3.index:
        try:
            df2 = df2.append({'person_id': i,
                              'location_id' : df3.loc[j, "end_location_id"],
                              'enter_timestamp' : df3.loc[j, "end_ts"],
                              'exit_timestamp' : df3.loc[j   1, "start_ts"]}, ignore_index=True)
        except:
            df2 = df2.append({'person_id': i,
                          'location_id' : df3.loc[j, "end_location_id"],
                          'enter_timestamp' : df3.loc[j, "end_ts"]}, ignore_index=True)

My desired output would be:

print(df2.to_string())
>>
  person_id location_id     enter_timestamp      exit_timestamp
0         A           2 2022-05-05 02:00:00 2022-05-05 05:00:00
1         A           3 2022-05-05 10:00:00 2022-05-05 13:00:00
2         A           2 2022-05-05 16:00:00                 NaT
3         B           6 2022-05-05 03:00:00                 NaT
4         C           8 2022-05-05 04:00:00 2022-05-05 11:00:00
5         C           9 2022-05-05 12:00:00                 NaT

How can I achieve this? Thanks

CodePudding user response:

First sorting rows by column person_id by DataFrame.sort_values, then create column exit_timestamp with DataFrameGroupBy.shift, rename columns and for final expected columns filter them by list cols:

cols = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"]
df = df.sort_values('person_id', ignore_index=True)
df['exit_timestamp'] = df.groupby('person_id')['start_ts'].shift(-1)
df = df.rename(columns={'end_location_id':'location_id', 'end_ts':'enter_timestamp'})[cols]
print (df)
  person_id  location_id     enter_timestamp      exit_timestamp
0         A            2 2022-05-05 02:00:00 2022-05-05 05:00:00
1         A            3 2022-05-05 10:00:00 2022-05-05 13:00:00
2         A            2 2022-05-05 16:00:00                 NaT
3         B            6 2022-05-05 03:00:00                 NaT
4         C            8 2022-05-05 04:00:00 2022-05-05 11:00:00
5         C            9 2022-05-05 12:00:00                 NaT
  • Related