Home > Software engineering >  Aggregating Pandas DataFrame rows based on multiple criteria (userID, end date = start date, etc.)
Aggregating Pandas DataFrame rows based on multiple criteria (userID, end date = start date, etc.)

Time:09-16

I've read through a lot of answers on how to aggregate rows in a pandas dataframe but I've had a hard time figuring out how to apply it to my case. I have a dataframe containing trips data for vehicles. So each vehicle within a given day can do several trips. Here's an example below:

vehicleID start pos time end pos time duration (seconds) meters travelled
XXXXX 2021-10-26 06:01:12 00:00 2021-10-26 06:25:06 00:00 1434 2000
XXXXX 2021-10-19 13:49:09 00:00 2021-10-19 13:59:29 00:00 620 5000
XXXXX 2021-10-19 13:20:36 00:00 2021-10-19 13:26:40 00:00 364 70000
YYYYY 2022-09-10 15:14:07 00:00 2022-09-10 15:29:39 00:00 932 8000
YYYYY 2022-08-28 15:16:35 00:00 2022-08-28 15:28:43 00:00 728 90000

It often happens that the start time of a trip, on a given day, is only a few minutes after the end time of the previous trip, which means that these can be chained into a single trip.

I would like to aggregate the rows so that if the new start pos time overlaps with the previous pos time, or a gap of less than 30 minutes happens between the two, these become a single row, summing the duration of the trip in seconds and meters travelled, obviously by vehicleID. The new df should also contain those trips that didn't require the aggregation (edited for clarity). So this is the output I'm trying to get:

vehicleID start pos time end pos time duration (seconds) meters travelled
XXXXX 2021-10-26 06:01:12 00:00 2021-10-26 06:25:06 00:00 1434 2000
XXXXX 2021-10-19 13:20:36 00:00 2021-10-19 13:59:29 00:00 984 75000
YYYYY 2022-09-10 15:14:07 00:00 2022-09-10 15:29:39 00:00 932 8000
YYYYY 2022-08-28 15:16:35 00:00 2022-08-28 15:28:43 00:00 728 90000

I feel like a groupby and an agg would be involved by I have no clue how to go about this. Any help would be appreciated! Thanks!

CodePudding user response:

I believe I have found a solution.

Setup

import pandas as pd
from datetime import timedelta

data = {'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY', 
                      4: 'YYYYY'}, 
        'start pos time': {0: '2021-10-26 06:01:12 00:00', 
                           1: '2021-10-19 13:49:09 00:00', 
                           2: '2021-10-19 13:20:36 00:00', 
                           3: '2022-09-10 15:14:07 00:00', 
                           4: '2022-08-28 15:16:35 00:00'}, 
        'end pos time': {0: '2021-10-26 06:25:06 00:00', 
                         1: '2021-10-19 13:59:29 00:00', 
                         2: '2021-10-19 13:26:40 00:00', 
                         3: '2022-09-10 15:29:39 00:00', 
                         4: '2022-08-28 15:28:43 00:00'}, 
        'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728}, 
        'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
        }

df = pd.DataFrame(data)

Assumptions:

  • All groups (unique values) in col vehicleID are in consecutive order.
  • Per group in col vehicleID the associated timestamps in col start pos time are sorted in descending order.

Problem

Within each group inside col vehicleID, if a start pos time is smaller than the end pos time of the previous trip (i.e. in the next row), or less than 30 minutes larger, then these rows should become a single row, with min for start pos time, max for end pos time, and sum for both duration and meters travelled.


Solution

# if still needed, change date time strings into timestamps
df[['start pos time','end pos time']] = df[['start pos time','end pos time']].\
    apply(lambda x: pd.to_datetime(x, infer_datetime_format=True))

# check (start time   timedelta 29m 59s) < (end time shifted)
cond1 = (df.loc[:,'end pos time'] timedelta(minutes=29, seconds=59))\
    .lt(df.loc[:,'start pos time'].shift(1))

# check `vehicleID` != it's own shift (this means a new group is starting)
# i.e. a new group should always get `True`
cond2 = (df.loc[:,'vehicleID'] != df.loc[:,'vehicleID'].shift(1))

# cumsum result of OR check conds
cond = (cond1 | cond2).cumsum()

# apply groupby on ['vehicleID' & cond] and aggregate appropriate functions
# (adding vehicleID is now unnecessary, but this keeps the col in the data)
res = df.groupby(['vehicleID', cond], as_index=False).agg(
    {'start pos time':'min',
     'end pos time':'max',
     'duration (seconds)':'sum',
     'meters travelled':'sum'}
    )

print(res)

  vehicleID            start pos time              end pos time  \
0     XXXXX 2021-10-26 06:01:12 00:00 2021-10-26 06:25:06 00:00   
1     XXXXX 2021-10-19 13:20:36 00:00 2021-10-19 13:59:29 00:00   
2     YYYYY 2022-09-10 15:14:07 00:00 2022-09-10 15:29:39 00:00   
3     YYYYY 2022-08-28 15:16:35 00:00 2022-08-28 15:28:43 00:00   

   duration (seconds)  meters travelled  
0                1434              2000  
1                 984             75000  
2                 932              8000  
3                 728             90000  

I've run a check: solution should also work if you have more than two trips in succession that stay within the defined range consecutively.

CodePudding user response:

There is probably a more efficient way to code this up, but something like this should work (the new_df has what you're looking for):

NOTE: code below assumes start and end times are in datetime format


df = pd.DataFrame({'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
                      4: 'YYYYY'},
        'start pos time': {0: '2021-10-26 06:01:12 00:00',
                           1: '2021-10-19 13:49:09 00:00',
                           2: '2021-10-19 13:20:36 00:00',
                           3: '2022-09-10 15:14:07 00:00',
                           4: '2022-08-28 15:16:35 00:00'},
        'end pos time': {0: '2021-10-26 06:25:06 00:00',
                         1: '2021-10-19 13:59:29 00:00',
                         2: '2021-10-19 13:26:40 00:00',
                         3: '2022-09-10 15:29:39 00:00',
                         4: '2022-08-28 15:28:43 00:00'},
        'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
        'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
        })

# sort dataframe by ID and then start time of trip
df = df.sort_values(by=['vehicleID', 'start pos time'])

# create a new column with the end time of the previous ride
df.loc[:, 'prev end'] = df['end pos time'].shift(1)

# create a new column with the difference between the start time of the current trip and the end time of the prior one
df.loc[:, 'diff'] = df.loc[:, 'start pos time'] - df.loc[:, 'prev end']


# helper function to convert difference between datetime objects to seconds
def get_total_seconds(datetime_delta):
    return datetime_delta.total_seconds()


# convert difference column to seconds
df.loc[:, 'diff'] = df['diff'].apply(get_total_seconds)

# where vehicle IDs are the same and the difference between the start time of the current trip and end time of the
# prior trip is less than or equal to 30 minutes, change the start time of the current trip to the start time of the 
# prior one
df.loc[((df['vehicleID'] == df['vehicleID'].shift(1)) & (df['diff'] <= 30*60)), 'start pos time'] = df['start pos time'].shift(1)

# create a new dataframe, grouped by vehicle ID and trip start time, using the maximum end time for each group
new_df = df.groupby(['vehicleID', 'start pos time'], as_index=False).agg({'end pos time':'max',
                                                                          'duration (seconds)':'sum',
                                                                          'meters travelled':'sum'})

CodePudding user response:

def func(d):
    mask = d.start_pos_time.sub(d.end_pos_time.shift(-1)).lt('30m')
    d.loc[mask, 'start_pos_time'] = d.start_pos_time.shift(-1)
    d = d.groupby('start_pos_time', as_index=False).agg({'end_pos_time': 'max', 'meters_travelled': 'sum'})
    return d

df = df.groupby('vehicleID').apply(func).reset_index('vehicleID').reset_index(drop=True)

df['duration_(seconds)'] = (df.end_pos_time - df.start_pos_time).dt.total_seconds()
print(df)

Output:

  vehicleID            start_pos_time              end_pos_time  meters_travelled  duration_(seconds)
0     XXXXX 2021-10-19 13:20:36 00:00 2021-10-19 13:59:29 00:00             75000              2333.0
1     XXXXX 2021-10-26 06:01:12 00:00 2021-10-26 06:25:06 00:00              2000              1434.0
2     YYYYY 2022-08-28 15:16:35 00:00 2022-08-28 15:28:43 00:00             90000               728.0
3     YYYYY 2022-09-10 15:14:07 00:00 2022-09-10 15:29:39 00:00              8000               932.0
  • Related