My dataset df looks like this:

main_id    time               day      lat           long
1          2019-05-31         1        53.5501667    9.9716466  
1          2019-05-31         1        53.6101545    9.9568781
1          2019-05-30         1        53.5501309    9.9716300
1          2019-05-30         2        53.5501309    9.9716300
1          2019-05-30         2        53.4561309    9.1246300
2          2019-06-31         4        53.5501667    9.9716466
2          2019-06-31         4        53.6101545    9.9568781

I want to find the total distance covered by each main_id item for each day. To calculate the distance between 2 set of coordinates, I can use this function:

def find_kms(coords_1, coords_2):
    return geopy.distance.geodesic(coords_1, coords_2).km

but I am not sure how I can sum it by grouping for main_id and day. The end result could be a new df like this:

main_id      day      total_dist     time
1            1        ...            2019-05-31 
1            2        ....           2019-05-31 
2            4        ....           2019-05-31

Where the derived time is any or the first value from the respective main_id and day time's column.

total_dist calculation:

For example, for the first row, for main_id == 1 and day 1, the total_dist would be calculated like this:

find_kms(( 53.5501667,9.9716466),(53.6101545,9.9568781))   find_kms((53.6101545,   9.9568781),(53.5501309,9.9716300)

CodePudding user response：

Note that your function is not vectorized hence making the work difficult.

(df.assign(dist = df.join(df.groupby(['main_id', 'day'])[['lat', 'long']].
   shift(), rsuffix='1').bfill().
   reset_index().groupby('index').
   apply(lambda x: find_kms(x[['lat','long']].values, x[['lat1','long1']].values))).
   groupby(['main_id', 'day'])['dist'].sum().reset_index())

  main_id  day       dist
0        1    1  13.499279
1        1    2  57.167034
2        2    4   6.747748

Another option will be to use reduce:

from functools import reduce

def total_dist(x):
    coords = x[['lat', 'long']].values
    lam = lambda x,y: (find_kms(x[1],y)   x[0],y)
    dist = reduce(lam, coords, (0,coords[0]))[0]
    return pd.Series({'dist':dist})

df.groupby(['main_id', 'day']).apply(total_dist).reset_index()
 
   main_id  day       dist
0        1    1  13.499351
1        1    2  57.167033
2        2    4   6.747775

EDIT:

If count is needed:

(pd.DataFrame({'count':df.groupby(['main_id', 'day']).main_id.count()}).
   join(df.groupby(['main_id', 'day']).apply(total_dist)))
Out[157]: 
             count       dist
main_id day                  
1       1        3  13.499351
        2        2  57.167033
2       4        2   6.747775

CodePudding user response：

Just another approach to convert the lat lon into utm and using groupby() without using find_kms function:

import pandas as pd import numpy as np import utm

u= utm.from_latlon(df.lat.values,df.long.values)
dx=u[0].tolist().diff().replace(np.nan, 0) # u[0] is lat, denoted by y, dx is diff of y 
dy=u[1].tolist().diff().replace(np.nan, 0) # u[1] is lon, denoted by x
df['dist']=np.sqrt(dx * dx   dy * dy)/1000

df1=df.groupby(['main_id', 'day'])[['dist', 'day']].sum()

Out: 
                  dist  count
main_id day                  
1       1    13.494554      3
        2    57.145276      4
2       4    63.892451      8