Home > Mobile >  Merging segments from the same trips into a single trip for analysis
Merging segments from the same trips into a single trip for analysis

Time:06-14

In the MWE below, I show my attempt to line-plot trips (from my df aggregated per month):

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

d = {'user': [62, 62, 154, 154, 154, 73, 73, 73, 73],
     'start': ['2008-06-20 04:21:40','2008-06-20 05:40:31','2007-05-01 10:10:00',
         '2007-05-01 10:36:00','2007-05-01 15:30:00','2008-05-19 16:25:54',
          '2008-05-21 02:21:37','2008-05-22 01:30:09','2008-05-29 01:55:59'],
    'end': ['2008-06-20 05:33:46','2008-06-20 05:53:11', '2007-05-01 10:36:00',
         '2007-05-01 11:00:00','2007-05-01 15:55:00','2008-05-19 16:29:22',
         '2008-05-21 02:25:04','2008-05-22 01:33:51','2008-05-29 01:59:25'],
 'mode': ['bus','walk','bus','walk','taxi','walk','walk','walk','walk'],}

df = pd.DataFrame(d)
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime)

df['Date'] = df['start'].dt.strftime('%Y-%m')

df.sort_values(['Date' ], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)

# aggregate monthly trips
df = df.groupby(['Date']).size().reset_index().rename(
    columns={0:'Trips'})
df
    Date   Trips
0   2007-05 3
1   2008-05 4
2   2008-06 2

# the I can plot trips like so:
sns.lineplot(data=df, x = df['Date'], y='Trips')

I realised in my df, some trips contains jump (maybe due to data log), so they should be merged into single trip before aggregation. In the given df example above (before grouping).

df
   user        start                 end       mode     Date
0   154 2007-05-01 10:10:00 2007-05-01 10:36:00 bus     2007-05
1   154 2007-05-01 10:36:00 2007-05-01 11:00:00 walk    2007-05
2   154 2007-05-01 15:30:00 2007-05-01 15:55:00 taxi    2007-05
3   73  2008-05-19 16:25:54 2008-05-19 16:29:22 walk    2008-05
4   73  2008-05-21 02:21:37 2008-05-21 02:25:04 walk    2008-05
5   73  2008-05-22 01:30:09 2008-05-22 01:33:51 walk    2008-05
6   73  2008-05-29 01:55:59 2008-05-29 01:59:25 walk    2008-05
7   62  2008-06-20 04:21:40 2008-06-20 05:33:46 bus     2008-06
8   62  2008-06-20 05:40:31 2008-06-20 05:53:11 walk    2008-06

User 154 does undertake 2-trips, not 3. First started at 10:10:00 on bus then changes travel mode to at 10:36:00 to walk. Also user 62 does only a trip, started at 04:21:40 - 05:33:46 on a bus, then stopped, waited for about 6-minutes then continued to at about 05:40:31.

Question

I want to merge such trips into a single trip so that if the time interval between the end of a user's trip to beginning of the next is less than 30 minutes, that should be considered a single trip. Then I can plot,

In the above MWE, the aggregated trips are "over-estimated", the expected final df should be:

    Date     Trips
0   2007-05     2
1   2008-05     4
2   2008-06     1

CodePudding user response:

You can use a custom function:

offset = pd.DateOffset(minutes=30)
merge_trip = lambda x: x['start'].ge(x['end'].shift()   offset).cumsum().add(1)

df['Trips'] = df.groupby('user').apply(merge_trip).droplevel('user')

out = df.groupby('Date', as_index=False)['Trips'].max()

Output:

>>> out
      Date  Trips
0  2007-05      2
1  2008-05      4
2  2008-06      1

>>> df
   user               start                 end  mode     Date  Trips
0   154 2007-05-01 10:10:00 2007-05-01 10:36:00   bus  2007-05      1
1   154 2007-05-01 10:36:00 2007-05-01 11:00:00  walk  2007-05      1
2   154 2007-05-01 15:30:00 2007-05-01 15:55:00  taxi  2007-05      2
3    73 2008-05-19 16:25:54 2008-05-19 16:29:22  walk  2008-05      1
4    73 2008-05-21 02:21:37 2008-05-21 02:25:04  walk  2008-05      2
5    73 2008-05-22 01:30:09 2008-05-22 01:33:51  walk  2008-05      3
6    73 2008-05-29 01:55:59 2008-05-29 01:59:25  walk  2008-05      4
7    62 2008-06-20 04:21:40 2008-06-20 05:33:46   bus  2008-06      1
8    62 2008-06-20 05:40:31 2008-06-20 05:53:11  walk  2008-06      1

CodePudding user response:

As an alternative you can use piso, specifically the bridge method which is used to fill gaps in an interval index which are below a threshold

Define a function which takes a dataframe, creates an interval index, then applies piso.bridge and returns a pandas.Series of the start times.

import pandas as pd
import piso

def bridge(df):
    ii = pd.IntervalIndex.from_arrays(df["start"], df["end"])
    return piso.bridge(ii, pd.Timedelta("30m")).left.to_series()

Then group by user and apply this function, convert to monthly periods, and count values

df.groupby("user").apply(bridge).dt.to_period("M").value_counts()

This gives a Series, indexed by the month periods

2008-05    4
2007-05    2
2008-06    1
Freq: M, dtype: int64
  • Related