I have a dataframe containing users' trips and mode of travel used to complish the trips. In the original trips file, some trips are actually a single trip but split into 2 trips due to user changing travel mode or a stop for some minutes. I want to make analysis number of trips covered for certain period, but clearly wouldn't like counting a single trip twice, as such I needed a way to identify such trips and only count them once.
For example, consider this example dataframe (extract from original file):
df = pd.DataFrame({
'user': [62,62,62,62,62,62,62,62,62],
'start': ['2008-06-20 04:21:40','2008-06-20 05:40:31','2008-06-21 04:23:39',
'2008-06-21 04:47:53','2008-07-13 05:45:27','2008-07-13 06:47:57',
'2008-07-14 09:08:06','2008-07-14 13:12:46','2008-07-14 13:24:23'],
'end': ['2008-06-20 05:33:46','2008-06-20 05:53:11','2008-06-21 04:35:15',
'2008-06-21 05:43:20','2008-07-13 06:02:54','2008-07-13 07:20:19',
'2008-07-14 09:17:15','2008-07-14 13:20:10','2008-07-14 13:30:43'],
'mode': ['bus','walk','bus','bus','bus','bus','taxi','bus','bus']
})
So that:
df
user start end mode
0 62 2008-06-20 04:21:40 2008-06-20 05:33:46 bus
1 62 2008-06-20 05:40:31 2008-06-20 05:53:11 walk
2 62 2008-06-21 04:23:39 2008-06-21 04:35:15 bus
3 62 2008-06-21 04:47:53 2008-06-21 05:43:20 bus
4 62 2008-07-13 05:45:27 2008-07-13 06:02:54 bus
5 62 2008-07-13 06:47:57 2008-07-13 07:20:19 bus
6 62 2008-07-14 09:08:06 2008-07-14 09:17:15 taxi
7 62 2008-07-14 13:12:46 2008-07-14 13:20:10 bus
8 62 2008-07-14 13:24:23 2008-07-14 13:30:43 bus
We can see from this particular user's trips that:
- the first trip ended at
05.33
and the second began at05.40
. This is a single trip only that the user stayed for about 7-minutes and then changed travel modebus -> walk
. - the third trip ended at
04.35
and the fouth began at04.47
. We consider this a single trip since the time in-between is less than 20-minutes. Perhaps the user experienced data logging issued or stayed for about 12-minutes then continued. - the fifth trip ended at
06:02
and the sixth began at06:47
. We consider these as 2-seperate trips since the time interval is beyond 20-minutes. - the 8th trip ended at
13.20
and the 9th began at13.24
. This is a single trip as well.
In the given example, this user covers 6-distinct trips
, which could otherwise be wrongly counted as 9-trips
. I have over 50 distinct users covering several trips, so this could lead to serious error in the trip analysis.
We consider 2 consecutive user's trips as single trip, if the time interval is less than 20 minutes.
I am not sure how to do such analyses of distinct trips using pandas.
CodePudding user response:
You can shift the end per group and calculate the difference to the next start. Then use it to compute groups:
# ensure datetime
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime)
# sort by user/start
df2 = df.sort_values(by=['user', 'start', 'end'])
# if end is within 20 min of next start, then keep in same group
group = df2['start'].sub(df2.groupby('user')['end'].shift()).gt('20 min').cumsum()
df['group'] = group
output:
user start end mode group
0 62 2008-06-20 04:21:40 2008-06-20 05:33:46 bus 0
1 62 2008-06-20 05:40:31 2008-06-20 05:53:11 walk 0
2 62 2008-06-21 04:23:39 2008-06-21 04:35:15 bus 1
3 62 2008-06-21 04:47:53 2008-06-21 05:43:20 bus 1
4 62 2008-07-13 05:45:27 2008-07-13 06:02:54 bus 2
5 62 2008-07-13 06:47:57 2008-07-13 07:20:19 bus 3
6 62 2008-07-14 09:08:06 2008-07-14 09:17:15 taxi 4
7 62 2008-07-14 13:12:46 2008-07-14 13:20:10 bus 5
8 62 2008-07-14 13:24:23 2008-07-14 13:30:43 bus 5
Aggregated data:
(df
.groupby(group)
.agg({'user': 'first', 'start': 'first', 'end': 'max',
'mode': lambda x: ' '.join(set(x))})
)
output:
user start end mode
0 62 2008-06-20 04:21:40 2008-06-20 05:53:11 bus walk
1 62 2008-06-21 04:23:39 2008-06-21 05:43:20 bus
2 62 2008-07-13 05:45:27 2008-07-13 06:02:54 bus
3 62 2008-07-13 06:47:57 2008-07-13 07:20:19 bus
4 62 2008-07-14 09:08:06 2008-07-14 09:17:15 taxi
5 62 2008-07-14 13:12:46 2008-07-14 13:30:43 bus