In the MWE below, I show my attempt to line-plot trips (from my df
aggregated per month):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
d = {'user': [62, 62, 154, 154, 154, 73, 73, 73, 73],
'start': ['2008-06-20 04:21:40','2008-06-20 05:40:31','2007-05-01 10:10:00',
'2007-05-01 10:36:00','2007-05-01 15:30:00','2008-05-19 16:25:54',
'2008-05-21 02:21:37','2008-05-22 01:30:09','2008-05-29 01:55:59'],
'end': ['2008-06-20 05:33:46','2008-06-20 05:53:11', '2007-05-01 10:36:00',
'2007-05-01 11:00:00','2007-05-01 15:55:00','2008-05-19 16:29:22',
'2008-05-21 02:25:04','2008-05-22 01:33:51','2008-05-29 01:59:25'],
'mode': ['bus','walk','bus','walk','taxi','walk','walk','walk','walk'],}
df = pd.DataFrame(d)
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime)
df['Date'] = df['start'].dt.strftime('%Y-%m')
df.sort_values(['Date' ], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# aggregate monthly trips
df = df.groupby(['Date']).size().reset_index().rename(
columns={0:'Trips'})
df
Date Trips
0 2007-05 3
1 2008-05 4
2 2008-06 2
# the I can plot trips like so:
sns.lineplot(data=df, x = df['Date'], y='Trips')
I realised in my df
, some trips contains jump (maybe due to data log), so they should be merged into single trip before aggregation. In the given df
example above (before grouping).
df
user start end mode Date
0 154 2007-05-01 10:10:00 2007-05-01 10:36:00 bus 2007-05
1 154 2007-05-01 10:36:00 2007-05-01 11:00:00 walk 2007-05
2 154 2007-05-01 15:30:00 2007-05-01 15:55:00 taxi 2007-05
3 73 2008-05-19 16:25:54 2008-05-19 16:29:22 walk 2008-05
4 73 2008-05-21 02:21:37 2008-05-21 02:25:04 walk 2008-05
5 73 2008-05-22 01:30:09 2008-05-22 01:33:51 walk 2008-05
6 73 2008-05-29 01:55:59 2008-05-29 01:59:25 walk 2008-05
7 62 2008-06-20 04:21:40 2008-06-20 05:33:46 bus 2008-06
8 62 2008-06-20 05:40:31 2008-06-20 05:53:11 walk 2008-06
User 154
does undertake 2-trips, not 3. First started at 10:10:00
on bus then changes travel mode to at 10:36:00
to walk
. Also user 62
does only a trip, started at 04:21:40 - 05:33:46
on a bus, then stopped, waited for about 6-minutes then continued to at about 05:40:31
.
Question
I want to merge such trips into a single trip so that if the time interval between the end of a user's trip to beginning of the next is less than 30 minutes, that should be considered a single trip. Then I can plot,
In the above MWE, the aggregated trips are "over-estimated", the expected final df
should be:
Date Trips
0 2007-05 2
1 2008-05 4
2 2008-06 1
CodePudding user response:
You can use a custom function:
offset = pd.DateOffset(minutes=30)
merge_trip = lambda x: x['start'].ge(x['end'].shift() offset).cumsum().add(1)
df['Trips'] = df.groupby('user').apply(merge_trip).droplevel('user')
out = df.groupby('Date', as_index=False)['Trips'].max()
Output:
>>> out
Date Trips
0 2007-05 2
1 2008-05 4
2 2008-06 1
>>> df
user start end mode Date Trips
0 154 2007-05-01 10:10:00 2007-05-01 10:36:00 bus 2007-05 1
1 154 2007-05-01 10:36:00 2007-05-01 11:00:00 walk 2007-05 1
2 154 2007-05-01 15:30:00 2007-05-01 15:55:00 taxi 2007-05 2
3 73 2008-05-19 16:25:54 2008-05-19 16:29:22 walk 2008-05 1
4 73 2008-05-21 02:21:37 2008-05-21 02:25:04 walk 2008-05 2
5 73 2008-05-22 01:30:09 2008-05-22 01:33:51 walk 2008-05 3
6 73 2008-05-29 01:55:59 2008-05-29 01:59:25 walk 2008-05 4
7 62 2008-06-20 04:21:40 2008-06-20 05:33:46 bus 2008-06 1
8 62 2008-06-20 05:40:31 2008-06-20 05:53:11 walk 2008-06 1
CodePudding user response:
As an alternative you can use piso
, specifically the bridge
method which is used to fill gaps in an interval index which are below a threshold
Define a function which takes a dataframe, creates an interval index, then applies piso.bridge
and returns a pandas.Series
of the start times.
import pandas as pd
import piso
def bridge(df):
ii = pd.IntervalIndex.from_arrays(df["start"], df["end"])
return piso.bridge(ii, pd.Timedelta("30m")).left.to_series()
Then group by user and apply this function, convert to monthly periods, and count values
df.groupby("user").apply(bridge).dt.to_period("M").value_counts()
This gives a Series, indexed by the month periods
2008-05 4
2007-05 2
2008-06 1
Freq: M, dtype: int64