Here's my code
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree, DistanceMetric
# df1
N = 2000
df1 = pd.DataFrame({'name': 'name' pd.RangeIndex(1, N 1).astype(str),
'lat': np.random.uniform(30, 65, N),
'lon': np.random.uniform(-150, -70, N)})
# df2
N = 25000
df2 = pd.DataFrame({'sitename': 'site' pd.RangeIndex(1, N 1).astype(str),
'lat': np.random.uniform(30, 65, N),
'lon': np.random.uniform(-150, -70, N)})
# bts
coords = np.radians(df1[['lat', 'lon']])
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(coords, metric=dist)
# airport_hostpital_1
coords = np.radians(df2[['lat', 'lon']])
distances, indices = tree.query(coords, k=1)
df1['sitename'] = df2.iloc[indices.ravel()]['sitename'].values
Here's my output
ValueError Traceback (most recent call last)
<ipython-input-4-7d20f0c5d9d7> in <module>
24 distances, indices = tree.query(coords, k=1)
25
---> 26 df1['sitename'] = df2.iloc[indices.ravel()]['sitename'].values
ValueError: Length of values (25000) does not match length of index (2000)
My Expected output
sitename lat lon name
0 site1 46.079246 -105.782183 name1209
1 site2 49.243516 -95.104086 name1091
2 site3 63.956400 -89.549558 name91
CodePudding user response:
From your previous question and my answer, you have swapped the dataframes. To fix your code:
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree, DistanceMetric
# df1
N = 2000
df1 = pd.DataFrame({'name': 'name' pd.RangeIndex(1, N 1).astype(str),
'lat': np.random.uniform(30, 65, N),
'lon': np.random.uniform(-150, -70, N)})
# df2
N = 25000
df2 = pd.DataFrame({'sitename': 'site' pd.RangeIndex(1, N 1).astype(str),
'lat': np.random.uniform(30, 65, N),
'lon': np.random.uniform(-150, -70, N)})
# bts
coords = np.radians(df2[['lat', 'lon']]) # HERE df1 -> df2
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(coords, metric=dist)
# airport_hostpital_1
coords = np.radians(df1[['lat', 'lon']]) # HERE df2 -> df1
distances, indices = tree.query(coords, k=1)
df1['sitename'] = df2.iloc[indices.ravel()]['sitename'].values
Output:
>>> df1
name lat lon sitename
0 name1 42.263207 -118.243787 site16231
1 name2 33.034391 -134.604954 site11275
2 name3 30.370661 -90.828936 site12107
3 name4 57.250977 -102.941565 site12079
4 name5 45.296180 -80.000868 site17749
... ... ... ... ...
1995 name1996 35.359411 -87.820709 site5675
1996 name1997 57.476931 -79.979884 site6402
1997 name1998 46.141786 -119.306523 site554
1998 name1999 49.551388 -86.893896 site8452
1999 name2000 55.836713 -76.379846 site5976
[2000 rows x 4 columns]