Home > database >  Adding new value through groupby operation using for cycle
Adding new value through groupby operation using for cycle

Time:03-19

I need to add a column with changes ow worker coordinates through different stages. We have a DataFrame:

import pandas as pd
from geopy.distance  import geodesic as GD

d = {'user_id': [26, 26, 26, 26, 26, 26, 9, 9, 9, 9],
            'worker_latitude': [55.114410, 55.114459, 55.114379, 
55.114462, 55.114372, 55.114389, 65.774064, 65.731034, 65.731034, 65.774057], 
            'worker_longitude': [38.927155, 38.927114, 38.927101, 38.927156,
 38.927258, 38.927120, 37.532380, 37.611746, 37.611746, 37.532346],
    'change':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

df = pd.DataFrame(data=d)

which looks like:

   user_id  worker_latitude  worker_longitude  change
0       26        55.114410         38.927155       0
1       26        55.114459         38.927114       0
2       26        55.114379         38.927101       0
3       26        55.114462         38.927156       0
4       26        55.114372         38.927258       0
5       26        55.114389         38.927120       0
6        9        65.774064         37.532380       0
7        9        65.731034         37.611746       0
8        9        65.731034         37.611746       0
9        9        65.774057         37.532346       0

Then I need to count difference between person previous and current stage. So I use a function:

for group in df.groupby(by='user_id'):
    group[1].reset_index(inplace=True,drop=True)
    for i in range(1,len(group[1])):
        first_xy=(group[1]['worker_latitude'][i-1],group[1]['worker_longitude'][i-1])
        second_xy=(group[1]['worker_latitude'][i],group[1]['worker_longitude'][i])
        print((round((GD(first_xy, second_xy).km),6)))
        group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)

And then I get:

6.021576
0.0
6.021896
0.00605
0.008945
0.009884
0.011948
0.009007
display(df)
   user_id  worker_latitude  worker_longitude  change
0       26        55.114410         38.927155       0
1       26        55.114459         38.927114       0
2       26        55.114379         38.927101       0
3       26        55.114462         38.927156       0
4       26        55.114372         38.927258       0
5       26        55.114389         38.927120       0
6        9        65.774064         37.532380       0
7        9        65.731034         37.611746       0
8        9        65.731034         37.611746       0
9        9        65.774057         37.532346       0

Which means that values are counted correctly, but for some reason they don't fit into 'change' column. What can be done?

CodePudding user response:

It doesn't works because you're accessing a copy of your DataFrame and trying to assign value to it.

However, it seems instead of iterating over the DataFrame inside groupby, it seems more intuitive to use groupby shift to get the first_xys first; then apply a custom function that applies GD between first_xy and second_xy to each row:

def func(x):
    if x.notna().all():
        first_xy = (x['prev_lat'], x['prev_long'])
        second_xy = (x['worker_latitude'], x['worker_longitude'])
        return round((GD(first_xy, second_xy).km), 6)
    else:
        return float('nan')

g = df.groupby('user_id')
df['prev_lat'] = g['worker_latitude'].shift()
df['prev_long'] = g['worker_longitude'].shift()
df['change'] = df.apply(func, axis=1)
df = df.drop(columns=['prev_lat','prev_long'])

Output:

   user_id  worker_latitude  worker_longitude    change
0       26        55.114410         38.927155       NaN
1       26        55.114459         38.927114  0.006050
2       26        55.114379         38.927101  0.008945
3       26        55.114462         38.927156  0.009884
4       26        55.114372         38.927258  0.011948
5       26        55.114389         38.927120  0.009007
6        9        65.774064         37.532380       NaN
7        9        65.731034         37.611746  6.021576
8        9        65.731034         37.611746  0.000000
9        9        65.774057         37.532346  6.021896

CodePudding user response:

I think that the problem could be the following line:

        group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)

You are updating the group variable, and you should update df variable instead. My suggestion for you fixing this property is:

        df.loc[i, "change"] = round((GD(first_xy, second_xy).km),6)

Considering that i is the row number that you want update, and "change" is the column name.

  • Related