distance for each trip-CodePudding

I want to calculate the distance between two gps coordinates (the first and the last one for each tripId), to get the distance for each trip my dataframe looks like that

   tripId   latitude    longitude   timestamp
0   1817603 53.155273   8.207176    2021-05-24 00:29:22
1   1817603 53.155271   8.206898    2021-05-24 00:29:38
2   1817603 53.155213   8.206314    2021-05-24 00:29:44
3   1817603 53.155135   8.206429    2021-05-24 00:29:50
4   1817603 53.154950   8.206565    2021-05-24 00:29:56
... ... ... ... ...
195 1817888 53.092805   8.212095    2021-05-24 08:27:54
196 1817888 53.093024   8.211756    2021-05-24 08:27:59
197 1817888 53.093305   8.211383    2021-05-24 08:28:05
198 1817888 53.093594   8.211026    2021-05-24 08:28:10
199 1817888 53.093853   8.210708    2021-05-24 08:28:15

i did that for each step using s = pd.Series(haversine_vector(df, df.shift(),Unit.KILOMETERS), index=df.index, name='distance_K') but i need to know the distance for the whole trip for each id I have used this as a test and it works, but i need to know the exact duration for each trip (the final duration)

for i in range(1,df.shape[0]-1):
    if df['tripId'][i]==df['tripId'][i 1]:
        df['distance'][i]=df['distance'][i-1] df['distance_K'][i]
    else:
        df['distance'][i]=df['distance_K'][i]

CodePudding user response：

Use groupby_apply to compute haversine distance for each trip:

# Inspired by https://stackoverflow.com/a/4913653/15239951
def haversine_series(sr):
    lon1 = sr['longitude']
    lat1 = sr['latitude']
    lon2 = sr['longitude'].shift(fill_value=sr['longitude'].iloc[0])
    lat2 = sr['latitude'].shift(fill_value=sr['latitude'].iloc[0])
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2.0)**2   np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6371 * c
    return km

df['distance_K'] = df.groupby('tripId').apply(haversine_series).droplevel(0)

Note: I suppose your dataframe is already sorted by timestamp column.

At this point, your dataframe looks like:

>>> df
      tripId   latitude  longitude            timestamp  distance_K
0    1817603  53.155273   8.207176  2021-05-24 00:29:22    0.000000
1    1817603  53.155271   8.206898  2021-05-24 00:29:38    0.018538
2    1817603  53.155213   8.206314  2021-05-24 00:29:44    0.039470
3    1817603  53.155135   8.206429  2021-05-24 00:29:50    0.011577
4    1817603  53.154950   8.206565  2021-05-24 00:29:56    0.022481
195  1817888  53.092805   8.212095  2021-05-24 08:27:54    0.000000
196  1817888  53.093024   8.211756  2021-05-24 08:27:59    0.033248
197  1817888  53.093305   8.211383  2021-05-24 08:28:05    0.039958
198  1817888  53.093594   8.211026  2021-05-24 08:28:10    0.040012
199  1817888  53.093853   8.210708  2021-05-24 08:28:15    0.035781

Now the total distance and time for each trip is easy to get with groupby_agg:

>>> df.groupby('tripId') \
      .agg(total_distance=('distance_K', 'sum'), 
           total_time=('timestamp', lambda x: x.max()-x.min())) \
      .reset_index()

    tripId  total_distance      total_time
0  1817603        0.092066 0 days 00:00:34
1  1817888        0.148999 0 days 00:00:21

CodePudding user response：

You can use

from haversine import haversine_vector

df = df.groupby('tripId').apply(
    lambda g: g.assign(distance=lambda g: [0, *haversine_vector(
        g.iloc[:-1][['latitude', 'longitude']].values,
        g.iloc[1:][['latitude', 'longitude']].values,
    )])
).droplevel(0)
df
#     tripId   latitude  longitude                timestamp  distance
# 0  1817603  53.155273   8.207176      2021-05-24 00:29:22  0.000000
# 1  1817603  53.155271   8.206898      2021-05-24 00:29:38  0.018538
# 2  1817603  53.155213   8.206314      2021-05-24 00:29:44  0.039470
# 3  1817603  53.155135   8.206429      2021-05-24 00:29:50  0.011577
# 4  1817603  53.154950   8.206565      2021-05-24 00:29:56  0.022481
# 5  1817888  53.092805   8.212095      2021-05-24 08:27:54  0.000000
# 6  1817888  53.093024   8.211756      2021-05-24 08:27:59  0.033248
# 7  1817888  53.093305   8.211383      2021-05-24 08:28:05  0.039958
# 8  1817888  53.093594   8.211026      2021-05-24 08:28:10  0.040012
# 9  1817888  53.093853   8.210708      2021-05-24 08:28:15  0.035781

and get the total time and distance

df.groupby('tripId').agg(
    {
        'timestamp': lambda g: g.iloc[-1] - g.iloc[0],
        'distance':'sum'
    }
)
#               timestamp  distance
# tripId                           
# 1817603 0 days 00:00:34  0.092066
# 1817888 0 days 00:00:21  0.148999