I have a dataframe df
that records the movement of people from one location to another. I would like to create a new dataframe df2
that records the time when a location was entered and when the same location was left, based on the trip information in df
.
import pandas as pd
import numpy as np
current_time = '2022-05-05 17:00'
df = pd.DataFrame({"person_id": ["A", "B", "A", "C", "A", "C"],
"start_location_id": [1, 5, 2, 7, 3, 8],
"end_location_id": [2, 6, 3, 8, 2, 9],
"start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
"end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05 12:00"]),
})
df
would look like this:
print(df.to_string())
>>
person_id start_location_id end_location_id start_ts end_ts
0 A 1 2 2022-05-05 00:00:00 2022-05-05 02:00:00
1 B 5 6 2022-05-05 00:00:00 2022-05-05 03:00:00
2 A 2 3 2022-05-05 05:00:00 2022-05-05 10:00:00
3 C 7 8 2022-05-05 00:00:00 2022-05-05 04:00:00
4 A 3 2 2022-05-05 13:00:00 2022-05-05 16:00:00
5 C 8 9 2022-05-05 11:00:00 2022-05-05 12:00:00
I then create df2
using the following nested for loops. I would like to replace the code that sets up df2
with code that uses pandas' groupby (or equivalent) functions. My current attempt is as follows:
df2 = pd.DataFrame(columns = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"])
for i in np.unique(df["person_id"]):
df3 = df.loc[df["person_id"] == i].reset_index(drop = True)
for j in df3.index:
try:
df2 = df2.append({'person_id': i,
'location_id' : df3.loc[j, "end_location_id"],
'enter_timestamp' : df3.loc[j, "end_ts"],
'exit_timestamp' : df3.loc[j 1, "start_ts"]}, ignore_index=True)
except:
df2 = df2.append({'person_id': i,
'location_id' : df3.loc[j, "end_location_id"],
'enter_timestamp' : df3.loc[j, "end_ts"]}, ignore_index=True)
My desired output would be:
print(df2.to_string())
>>
person_id location_id enter_timestamp exit_timestamp
0 A 2 2022-05-05 02:00:00 2022-05-05 05:00:00
1 A 3 2022-05-05 10:00:00 2022-05-05 13:00:00
2 A 2 2022-05-05 16:00:00 NaT
3 B 6 2022-05-05 03:00:00 NaT
4 C 8 2022-05-05 04:00:00 2022-05-05 11:00:00
5 C 9 2022-05-05 12:00:00 NaT
How can I achieve this? Thanks
CodePudding user response:
First sorting rows by column person_id
by DataFrame.sort_values
, then create column exit_timestamp
with DataFrameGroupBy.shift
, rename
columns and for final expected columns filter them by list cols
:
cols = ["person_id", "location_id", "enter_timestamp", "exit_timestamp"]
df = df.sort_values('person_id', ignore_index=True)
df['exit_timestamp'] = df.groupby('person_id')['start_ts'].shift(-1)
df = df.rename(columns={'end_location_id':'location_id', 'end_ts':'enter_timestamp'})[cols]
print (df)
person_id location_id enter_timestamp exit_timestamp
0 A 2 2022-05-05 02:00:00 2022-05-05 05:00:00
1 A 3 2022-05-05 10:00:00 2022-05-05 13:00:00
2 A 2 2022-05-05 16:00:00 NaT
3 B 6 2022-05-05 03:00:00 NaT
4 C 8 2022-05-05 04:00:00 2022-05-05 11:00:00
5 C 9 2022-05-05 12:00:00 NaT