I've read through a lot of answers on how to aggregate rows in a pandas dataframe but I've had a hard time figuring out how to apply it to my case. I have a dataframe containing trips data for vehicles. So each vehicle within a given day can do several trips. Here's an example below:
vehicleID | start pos time | end pos time | duration (seconds) | meters travelled |
---|---|---|---|---|
XXXXX | 2021-10-26 06:01:12 00:00 | 2021-10-26 06:25:06 00:00 | 1434 | 2000 |
XXXXX | 2021-10-19 13:49:09 00:00 | 2021-10-19 13:59:29 00:00 | 620 | 5000 |
XXXXX | 2021-10-19 13:20:36 00:00 | 2021-10-19 13:26:40 00:00 | 364 | 70000 |
YYYYY | 2022-09-10 15:14:07 00:00 | 2022-09-10 15:29:39 00:00 | 932 | 8000 |
YYYYY | 2022-08-28 15:16:35 00:00 | 2022-08-28 15:28:43 00:00 | 728 | 90000 |
It often happens that the start time of a trip, on a given day, is only a few minutes after the end time of the previous trip, which means that these can be chained into a single trip.
I would like to aggregate the rows so that if the new start pos time overlaps with the previous pos time, or a gap of less than 30 minutes happens between the two, these become a single row, summing the duration of the trip in seconds and meters travelled, obviously by vehicleID. The new df should also contain those trips that didn't require the aggregation (edited for clarity). So this is the output I'm trying to get:
vehicleID | start pos time | end pos time | duration (seconds) | meters travelled |
---|---|---|---|---|
XXXXX | 2021-10-26 06:01:12 00:00 | 2021-10-26 06:25:06 00:00 | 1434 | 2000 |
XXXXX | 2021-10-19 13:20:36 00:00 | 2021-10-19 13:59:29 00:00 | 984 | 75000 |
YYYYY | 2022-09-10 15:14:07 00:00 | 2022-09-10 15:29:39 00:00 | 932 | 8000 |
YYYYY | 2022-08-28 15:16:35 00:00 | 2022-08-28 15:28:43 00:00 | 728 | 90000 |
I feel like a groupby and an agg would be involved by I have no clue how to go about this. Any help would be appreciated! Thanks!
CodePudding user response:
I believe I have found a solution.
Setup
import pandas as pd
from datetime import timedelta
data = {'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
4: 'YYYYY'},
'start pos time': {0: '2021-10-26 06:01:12 00:00',
1: '2021-10-19 13:49:09 00:00',
2: '2021-10-19 13:20:36 00:00',
3: '2022-09-10 15:14:07 00:00',
4: '2022-08-28 15:16:35 00:00'},
'end pos time': {0: '2021-10-26 06:25:06 00:00',
1: '2021-10-19 13:59:29 00:00',
2: '2021-10-19 13:26:40 00:00',
3: '2022-09-10 15:29:39 00:00',
4: '2022-08-28 15:28:43 00:00'},
'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
}
df = pd.DataFrame(data)
Assumptions:
- All groups (unique values) in col
vehicleID
are in consecutive order. - Per group in col
vehicleID
the associated timestamps in colstart pos time
are sorted in descending order.
Problem
Within each group inside col vehicleID
, if a start pos time is smaller than the end pos time of the previous trip (i.e. in the next row), or less than 30 minutes larger, then these rows should become a single row, with min
for start pos time, max
for end pos time, and sum
for both duration and meters travelled.
Solution
# if still needed, change date time strings into timestamps
df[['start pos time','end pos time']] = df[['start pos time','end pos time']].\
apply(lambda x: pd.to_datetime(x, infer_datetime_format=True))
# check (start time timedelta 29m 59s) < (end time shifted)
cond1 = (df.loc[:,'end pos time'] timedelta(minutes=29, seconds=59))\
.lt(df.loc[:,'start pos time'].shift(1))
# check `vehicleID` != it's own shift (this means a new group is starting)
# i.e. a new group should always get `True`
cond2 = (df.loc[:,'vehicleID'] != df.loc[:,'vehicleID'].shift(1))
# cumsum result of OR check conds
cond = (cond1 | cond2).cumsum()
# apply groupby on ['vehicleID' & cond] and aggregate appropriate functions
# (adding vehicleID is now unnecessary, but this keeps the col in the data)
res = df.groupby(['vehicleID', cond], as_index=False).agg(
{'start pos time':'min',
'end pos time':'max',
'duration (seconds)':'sum',
'meters travelled':'sum'}
)
print(res)
vehicleID start pos time end pos time \
0 XXXXX 2021-10-26 06:01:12 00:00 2021-10-26 06:25:06 00:00
1 XXXXX 2021-10-19 13:20:36 00:00 2021-10-19 13:59:29 00:00
2 YYYYY 2022-09-10 15:14:07 00:00 2022-09-10 15:29:39 00:00
3 YYYYY 2022-08-28 15:16:35 00:00 2022-08-28 15:28:43 00:00
duration (seconds) meters travelled
0 1434 2000
1 984 75000
2 932 8000
3 728 90000
I've run a check: solution should also work if you have more than two trips in succession that stay within the defined range consecutively.
CodePudding user response:
There is probably a more efficient way to code this up, but something like this should work (the new_df has what you're looking for):
NOTE: code below assumes start and end times are in datetime format
df = pd.DataFrame({'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
4: 'YYYYY'},
'start pos time': {0: '2021-10-26 06:01:12 00:00',
1: '2021-10-19 13:49:09 00:00',
2: '2021-10-19 13:20:36 00:00',
3: '2022-09-10 15:14:07 00:00',
4: '2022-08-28 15:16:35 00:00'},
'end pos time': {0: '2021-10-26 06:25:06 00:00',
1: '2021-10-19 13:59:29 00:00',
2: '2021-10-19 13:26:40 00:00',
3: '2022-09-10 15:29:39 00:00',
4: '2022-08-28 15:28:43 00:00'},
'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
})
# sort dataframe by ID and then start time of trip
df = df.sort_values(by=['vehicleID', 'start pos time'])
# create a new column with the end time of the previous ride
df.loc[:, 'prev end'] = df['end pos time'].shift(1)
# create a new column with the difference between the start time of the current trip and the end time of the prior one
df.loc[:, 'diff'] = df.loc[:, 'start pos time'] - df.loc[:, 'prev end']
# helper function to convert difference between datetime objects to seconds
def get_total_seconds(datetime_delta):
return datetime_delta.total_seconds()
# convert difference column to seconds
df.loc[:, 'diff'] = df['diff'].apply(get_total_seconds)
# where vehicle IDs are the same and the difference between the start time of the current trip and end time of the
# prior trip is less than or equal to 30 minutes, change the start time of the current trip to the start time of the
# prior one
df.loc[((df['vehicleID'] == df['vehicleID'].shift(1)) & (df['diff'] <= 30*60)), 'start pos time'] = df['start pos time'].shift(1)
# create a new dataframe, grouped by vehicle ID and trip start time, using the maximum end time for each group
new_df = df.groupby(['vehicleID', 'start pos time'], as_index=False).agg({'end pos time':'max',
'duration (seconds)':'sum',
'meters travelled':'sum'})
CodePudding user response:
def func(d):
mask = d.start_pos_time.sub(d.end_pos_time.shift(-1)).lt('30m')
d.loc[mask, 'start_pos_time'] = d.start_pos_time.shift(-1)
d = d.groupby('start_pos_time', as_index=False).agg({'end_pos_time': 'max', 'meters_travelled': 'sum'})
return d
df = df.groupby('vehicleID').apply(func).reset_index('vehicleID').reset_index(drop=True)
df['duration_(seconds)'] = (df.end_pos_time - df.start_pos_time).dt.total_seconds()
print(df)
Output:
vehicleID start_pos_time end_pos_time meters_travelled duration_(seconds)
0 XXXXX 2021-10-19 13:20:36 00:00 2021-10-19 13:59:29 00:00 75000 2333.0
1 XXXXX 2021-10-26 06:01:12 00:00 2021-10-26 06:25:06 00:00 2000 1434.0
2 YYYYY 2022-08-28 15:16:35 00:00 2022-08-28 15:28:43 00:00 90000 728.0
3 YYYYY 2022-09-10 15:14:07 00:00 2022-09-10 15:29:39 00:00 8000 932.0