The Geolife dataset is a GPS trajectories of users logged as they move. Thanks to Sina Dabiri for providing a repository of the preprocessed data. I work with his preprocessed data and created a dataframe of GSP logs for the 69 users available.
In this post is a very little extract of the data for 3 user to describe by question.
import pandas as pd
data = {'user': [10,10,10,10,10,10,10,10,21,21,21,54,54,54,54,54,54,54,54,54],
'lat': [39.921683,39.921583,39.92156,39.13622,39.136233,39.136241,39.136246,39.136251,42.171678,42.172055,
42.172243,39.16008333,39.15823333,39.1569,39.156,39.15403333,39.15346667,39.15273333,39.14811667,39.14753333],
'lon': [116.472342,116.472315,116.47229,117.218033,117.218046,117.218066,117.218166,117.218186,123.676778,123.677365,
123.677657,117.1994167,117.2002333,117.2007667,117.2012167,117.202,117.20225,117.20255,117.2043167,117.2045833],
'date': ['2009-03-21 13:30:35','2009-03-21 13:33:38','2009-03-21 13:34:40','2009-03-21 15:30:12','2009-03-21 15:32:35',
'2009-03-21 15:38:36','2009-03-21 15:44:42','2009-03-21 15:48:43','2007-04-30 16:00:20', '2007-04-30 16:05:22',
'2007-04-30 16:08:23','2007-04-30 11:47:38','2007-04-30 11:48:07','2007-04-30 11:48:27','2007-04-30 12:04:39',
'2007-04-30 12:04:07','2007-04-30 12:04:32','2007-04-30 12:19:41','2007-04-30 12:20:08','2007-04-30 12:20:21']
}
And the dataframe:
df = pd.DataFrame(data)
df
user lat lon date
0 10 39.921683 116.472342 2009-03-21 13:30:35
1 10 39.921583 116.472315 2009-03-21 13:33:38
2 10 39.921560 116.472290 2009-03-21 13:34:40
3 10 39.136220 117.218033 2009-03-21 15:30:12
4 10 39.136233 117.218046 2009-03-21 15:32:35
5 10 39.136241 117.218066 2009-03-21 15:38:36
6 10 39.136246 117.218166 2009-03-21 15:44:42
7 10 39.136251 117.218186 2009-03-21 15:48:43
8 21 42.171678 123.676778 2007-04-30 16:00:20
9 21 42.172055 123.677365 2007-04-30 16:05:22
10 21 42.172243 123.677657 2007-04-30 16:08:23
11 54 39.160083 117.199417 2007-04-30 11:47:38
12 54 39.158233 117.200233 2007-04-30 11:48:07
13 54 39.156900 117.200767 2007-04-30 11:48:27
14 54 39.156000 117.201217 2007-04-30 12:04:39
15 54 39.154033 117.202000 2007-04-30 12:04:07
16 54 39.153467 117.202250 2007-04-30 12:04:32
17 54 39.152733 117.202550 2007-04-30 12:19:41
18 54 39.148117 117.204317 2007-04-30 12:20:08
19 54 39.147533 117.204583 2007-04-30 12:20:21
My Question:
I want calculate for how long users travel in a particular period.
For example.
- Total time users travelled in
March-2009
: Only user 10 travelled in this month. On2009-03-21
from13:30:35
. But then after13:34:40
there is a long jump to15:30:12
. Since this jumped period is more than 30-minutes, we consider it another trip. So user 10 has 2 trips recorded that month. First for about 5-minutes, second for about 19 minutes. So the duration of users trip for this month is5 19 = 24 minutes
. - In
April 2007
, users 21 and 54 recorded trips on the same day. User 21 started at16:00:20
for about 8-minutes. User 54 started at11:47:38
and after about 1-minute, we see a jump to12:04:39
. The time interval is not up to 30-minutes, so we consider it a single trip. For that, 54 covered trip for about 33-minutes. Users trip time in that month is therefore8 33 = 41minutes
. - Sometimes, I would also want to determined trip time from say
February 2008
toMarch 2009
.
How do I perform this sort of analysis?
Any point to, using the little data provided above would be appreciated.
CodePudding user response:
this code isn't the most effective, anyway you can test does it do what you need:
df['date'] = pd.to_datetime(df['date'])
duration = (df.groupby(['user', df['date'].dt.month]).
apply(lambda x: (x['date']-x['date'].shift()).dt.seconds).
rename('duration').
to_frame())
res = (duration.mask(duration>1800,0). # 1800 - limit for a trip duration in seconds
groupby(level=[0,1]).
sum().
truediv(60). # converting seconds to minutes
rename_axis(index={'date':'month'}))
print(res)
'''
duration
user month
10 3 22.60
21 4 8.05
54 4 33.25