I need to add a column with changes ow worker coordinates through different stages. We have a DataFrame:
import pandas as pd
from geopy.distance import geodesic as GD
d = {'user_id': [26, 26, 26, 26, 26, 26, 9, 9, 9, 9],
'worker_latitude': [55.114410, 55.114459, 55.114379,
55.114462, 55.114372, 55.114389, 65.774064, 65.731034, 65.731034, 65.774057],
'worker_longitude': [38.927155, 38.927114, 38.927101, 38.927156,
38.927258, 38.927120, 37.532380, 37.611746, 37.611746, 37.532346],
'change':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data=d)
which looks like:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
Then I need to count difference between person previous and current stage. So I use a function:
for group in df.groupby(by='user_id'):
group[1].reset_index(inplace=True,drop=True)
for i in range(1,len(group[1])):
first_xy=(group[1]['worker_latitude'][i-1],group[1]['worker_longitude'][i-1])
second_xy=(group[1]['worker_latitude'][i],group[1]['worker_longitude'][i])
print((round((GD(first_xy, second_xy).km),6)))
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
And then I get:
6.021576
0.0
6.021896
0.00605
0.008945
0.009884
0.011948
0.009007
display(df)
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
Which means that values are counted correctly, but for some reason they don't fit into 'change' column. What can be done?
CodePudding user response:
It doesn't works because you're accessing a copy of your DataFrame and trying to assign value to it.
However, it seems instead of iterating over the DataFrame inside groupby
, it seems more intuitive to use groupby
shift
to get the first_xy
s first; then apply
a custom function that applies GD between first_xy
and second_xy
to each row:
def func(x):
if x.notna().all():
first_xy = (x['prev_lat'], x['prev_long'])
second_xy = (x['worker_latitude'], x['worker_longitude'])
return round((GD(first_xy, second_xy).km), 6)
else:
return float('nan')
g = df.groupby('user_id')
df['prev_lat'] = g['worker_latitude'].shift()
df['prev_long'] = g['worker_longitude'].shift()
df['change'] = df.apply(func, axis=1)
df = df.drop(columns=['prev_lat','prev_long'])
Output:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 NaN
1 26 55.114459 38.927114 0.006050
2 26 55.114379 38.927101 0.008945
3 26 55.114462 38.927156 0.009884
4 26 55.114372 38.927258 0.011948
5 26 55.114389 38.927120 0.009007
6 9 65.774064 37.532380 NaN
7 9 65.731034 37.611746 6.021576
8 9 65.731034 37.611746 0.000000
9 9 65.774057 37.532346 6.021896
CodePudding user response:
I think that the problem could be the following line:
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
You are updating the group
variable, and you should update df
variable instead. My suggestion for you fixing this property is:
df.loc[i, "change"] = round((GD(first_xy, second_xy).km),6)
Considering that i
is the row number that you want update, and "change"
is the column name.