My dataset df
looks like this:
main_id time day lat long
1 2019-05-31 1 53.5501667 9.9716466
1 2019-05-31 1 53.6101545 9.9568781
1 2019-05-30 1 53.5501309 9.9716300
1 2019-05-30 2 53.5501309 9.9716300
1 2019-05-30 2 53.4561309 9.1246300
2 2019-06-31 4 53.5501667 9.9716466
2 2019-06-31 4 53.6101545 9.9568781
I want to find the total distance covered by each main_id
item for each day
. To calculate the distance between 2 set of coordinates, I can use this function:
def find_kms(coords_1, coords_2):
return geopy.distance.geodesic(coords_1, coords_2).km
but I am not sure how I can sum it by grouping for main_id
and day
. The end result could be a new df like this:
main_id day total_dist time
1 1 ... 2019-05-31
1 2 .... 2019-05-31
2 4 .... 2019-05-31
Where the derived time
is any or the first value from the respective main_id
and day
time's column.
total_dist calculation:
For example, for the first row, for main_id == 1
and day 1
, the total_dist would be calculated like this:
find_kms(( 53.5501667,9.9716466),(53.6101545,9.9568781)) find_kms((53.6101545, 9.9568781),(53.5501309,9.9716300)
CodePudding user response:
Note that your function is not vectorized hence making the work difficult.
(df.assign(dist = df.join(df.groupby(['main_id', 'day'])[['lat', 'long']].
shift(), rsuffix='1').bfill().
reset_index().groupby('index').
apply(lambda x: find_kms(x[['lat','long']].values, x[['lat1','long1']].values))).
groupby(['main_id', 'day'])['dist'].sum().reset_index())
main_id day dist
0 1 1 13.499279
1 1 2 57.167034
2 2 4 6.747748
Another option will be to use reduce
:
from functools import reduce
def total_dist(x):
coords = x[['lat', 'long']].values
lam = lambda x,y: (find_kms(x[1],y) x[0],y)
dist = reduce(lam, coords, (0,coords[0]))[0]
return pd.Series({'dist':dist})
df.groupby(['main_id', 'day']).apply(total_dist).reset_index()
main_id day dist
0 1 1 13.499351
1 1 2 57.167033
2 2 4 6.747775
EDIT:
If count is needed:
(pd.DataFrame({'count':df.groupby(['main_id', 'day']).main_id.count()}).
join(df.groupby(['main_id', 'day']).apply(total_dist)))
Out[157]:
count dist
main_id day
1 1 3 13.499351
2 2 57.167033
2 4 2 6.747775
CodePudding user response:
Just another approach to convert the lat lon into utm and using groupby()
without using find_kms
function:
import pandas as pd import numpy as np import utm
u= utm.from_latlon(df.lat.values,df.long.values)
dx=u[0].tolist().diff().replace(np.nan, 0) # u[0] is lat, denoted by y, dx is diff of y
dy=u[1].tolist().diff().replace(np.nan, 0) # u[1] is lon, denoted by x
df['dist']=np.sqrt(dx * dx dy * dy)/1000
df1=df.groupby(['main_id', 'day'])[['dist', 'day']].sum()
Out:
dist count
main_id day
1 1 13.494554 3
2 57.145276 4
2 4 63.892451 8