I need to find 'user_id' of users standing closeby to each other. So we have data:
import pandas as pd
d = {'user_id': [11,24,101,214,302,335],
'worker_latitude': [-34.6209, -2.7572, 55.6621,
55.114462, 55.6622,-34.6209],
'worker_longitude': [-58.3742, 52.3879, 56.6621, 38.927156,
56.6622, 39.018]}
df = pd.DataFrame(data=d)
df
user_id worker_latitude worker_longitude
0 11 -34.620900 -58.374200
1 24 -2.757200 52.387900
2 101 55.662100 56.662100
3 214 55.114462 38.927156
4 302 55.662200 56.662200
5 335 -34.620900 39.018000
So, in this dataset it would be users with id '101' and '302'. But our dataset has millions of lines in it. Are there any built-in functions in pandas or python to solve the issue?
CodePudding user response:
Assuming the workers need to share the same location to be considered standing closeby, a groupby by location can match workers efficiently:
from itertools import combinations
import pandas as pd
d = {'user_id': [11, 24, 101, 214, 302, 335],
'worker_latitude': [-34.6209, -2.7572, 55.6621,
55.114462, 55.6621, -34.6209],
'worker_longitude': [-58.3742, 52.3879, 56.6621, 38.927156,
56.6621, 39.018]}
df = pd.DataFrame(data=d)
matched_workers = df.groupby(['worker_latitude', 'worker_longitude']).apply(
lambda rows: list(combinations(rows['user_id'], r=2)))
matched_workers = matched_workers.loc[matched_workers.apply(bool)]
Which outputs:
worker_latitude worker_longitude
55.6621 56.6621 [(101, 302)]
dtype: object