I have been following this tutorial on how to find nearest neighbors of a point with scikit.
However, when it comes to displaying the data, the tutorial merely mentions that "the indices can be mapped to useful values and the two arrays merged with the rest of the data"
But there's no actual explanation on how to do this. I'm not very well-versed in Pandas and I don't know how to perform this merge, so I just end up with 2 multidimensional arrays and I don't know how to map them to the original data to study the example and experiment with it.
This is the code
import numpy as np
from sklearn.neighbors import BallTree, KDTree
import pandas as pd
# Column names for the example DataFrame.
column_names = ["STATION NAME", "LAT", "LON"]
# A list of locations that will be used to construct the binary
# tree.
locations_a = [['BEAUFORT', 32.4, -80.633],
['CONWAY HORRY COUNTY AIRPORT', 33.828, -79.122],
['HUSTON/EXECUTIVE', 29.8, -95.9],
['ELIZABETHTON MUNI', 36.371, -82.173],
['JACK BARSTOW AIRPORT', 43.663, -84.261],
['MARLBORO CO JETPORT H E AVENT', 34.622, -79.734],
['SUMMERVILLE AIRPORT', 33.063, -80.279]]
# A list of locations that will be used to construct the queries.
# for neighbors.
locations_b = [['BOOMVANG HELIPORT / OIL PLATFORM', 27.35, -94.633],
['LEE COUNTY AIRPORT', 36.654, -83.218],
['ELLINGTON', 35.507, -86.804],
['LAWRENCEVILLE BRUNSWICK MUNI', 36.773, -77.794],
['PUTNAM CO', 39.63, -86.814]]
# Converting the lists to DataFrames. We will build the tree with
# the first and execute the query on the second.
locations_a = pd.DataFrame(locations_a, columns = column_names)
locations_b = pd.DataFrame(locations_b, columns = column_names)
# Creates new columns converting coordinate degrees to radians.
for column in locations_a[["LAT", "LON"]]:
rad = np.deg2rad(locations_a[column].values)
locations_a[f'{column}_rad'] = rad
for column in locations_b[["LAT", "LON"]]:
rad = np.deg2rad(locations_b[column].values)
locations_b[f'{column}_rad'] = rad
# Takes the first group's latitude and longitude values to construct
# the ball tree.
ball = BallTree(locations_a[["LAT_rad", "LON_rad"]].values, metric='haversine')
# The amount of neighbors to return.
k = 1
# Executes a query with the second group. This will also return two
# arrays.
distances, indices = ball.query(locations_b[["LAT_rad", "LON_rad"]].values, k = k)
#converting to kilometers
distances = distances * 6.371
So how do I take distances
and indices
and map them to my dataframe to visually see the nearest neighbor of each point?
CodePudding user response:
Each integer index in indices
refers to an index value (row number) of locations_a
. You can use locations_a.loc[]
to convert these indices to their corresponding station names as a numpy array:
nearest_station_names = locations_a.loc[indices.flatten()]['STATION NAME'].to_numpy()
(Why indices.flatten()
instead of just indices
? ball.query
returns distances
and indices
as two-dimensional numpy arrays, where the second dimension (the number of columns) is 1. For indices
to work in df.loc[]
, you need to "flatten" it into a one-dimensional array whose only dimension is the number of rows.)
Next, insert the names as a new column into locations_b
:
locations_b['nearest_stn'] = nearest_station_names
Then insert distances
as another new column (no need to .flatten
in this case):
locations_b['nearest_stn_dist'] = distances
# Print without radian columns for brevity
print(locations_b.drop(columns=['LAT_rad', 'LON_rad']))
STATION NAME LAT LON nearest_stn nearest_stn_km
0 BOOMVANG HELIPORT / OIL PLATFORM 27.350 -94.633 HUSTON/EXECUTIVE 299.198339
1 LEE COUNTY AIRPORT 36.654 -83.218 ELIZABETHTON MUNI 98.550423
2 ELLINGTON 35.507 -86.804 ELIZABETHTON MUNI 427.798176
3 LAWRENCEVILLE BRUNSWICK MUNI 36.773 -77.794 MARLBORO CO JETPORT H E AVENT 296.458070
4 PUTNAM CO 39.630 -86.814 JACK BARSTOW AIRPORT 496.025005
CodePudding user response:
As per the documentation for BallTree()
, indices
is a 2d array of shape (len(X), k), where X
is the array supplied to the tree in the query (here it is locations_b
).
So, when you query with locations_b
and k=1
, you receive back a 2d array of shape (5, 1) that represents the nearest neighbor of each station of locations_b
as an index of the original X
used to fit the BallTree
, in this case locations_a
.
This means you now have the indices of locations_a
that represent the nearest neighbor(s) for each station in locations_b
.
In the case of k=1
, you can get information about the nearest station by indexing locations_a
with the indices resulting from your query like so:
neighbors = pd.DataFrame({'distance': distances.flatten(), 'neighbor_idx': indices.flatten()})
>>> neighbors
distance neighbor_idx
0 0.299198 2
1 0.098550 3
2 0.427798 3
3 0.296458 5
4 0.496025 4
In the below code, we join
locations_b
to neighbors
, which joins on index by default, then merge one column of locations_a
, using our neighbor_idx
field to match the index of locations_a
.
locations_b\
.join(neighbors)\
.merge(locations_a[['STATION NAME']], left_on='neighbor_idx', right_index=True, suffixes=("", "_neighbor"))\
.drop('neighbor_idx', axis=1)
STATION NAME LAT LON LAT_rad LON_rad distance STATION NAME_neighbor
0 BOOMVANG HELIPORT / OIL PLATFORM 27.350 -94.633 0.477348 -1.651657 0.299198 HUSTON/EXECUTIVE
1 LEE COUNTY AIRPORT 36.654 -83.218 0.639733 -1.452428 0.098550 ELIZABETHTON MUNI
2 ELLINGTON 35.507 -86.804 0.619714 -1.515016 0.427798 ELIZABETHTON MUNI
3 LAWRENCEVILLE BRUNSWICK MUNI 36.773 -77.794 0.641810 -1.357761 0.296458 MARLBORO CO JETPORT H E AVENT
4 PUTNAM CO 39.630 -86.814 0.691674 -1.515190 0.496025 JACK BARSTOW AIRPORT
You can of course choose to merge additional columns from locations_a
.
In the general case of more than one nearest neighbor, the above approach won't work quite right. Here's a way to tackle that:
k=3
distances, indices = ball.query(locations_b[["LAT_rad", "LON_rad"]].values, k = k)
#converting to kilometers
distances = distances * 6.371
dists = pd.DataFrame(distances).stack()
rel = pd.DataFrame(indices).stack()
neighbor_df = pd.merge(dists.rename('distance'), rel.rename('neighbor_idx'), right_index=True, left_index=True)
neighbor_df = neighbor_df.reset_index(level=1)
neighbor_df.columns = ['neighbor_number', 'distance', 'neighbor_idx']
>>> neighbor_df
neighbor_number distance neighbor_idx
0 0 0.299198 2
0 1 1.460424 0
0 2 1.516741 6
1 0 0.098550 3
1 1 0.387486 5
1 2 0.480926 6
2 0 0.427798 3
2 1 0.650799 5
2 2 0.658009 6
3 0 0.296458 5
3 1 0.348930 1
3 2 0.393563 3
4 0 0.496025 4
4 1 0.544545 3
4 2 0.838587 5
locations_b\
.join(neighbor_df)\
.merge(locations_a[['STATION NAME']], left_on='neighbor_idx', right_index=True, suffixes=("", "_neighbor"))\
.drop('neighbor_idx', axis=1)
Results:
STATION NAME LAT LON LAT_rad LON_rad neighbor_number distance STATION NAME_neighbor
0 BOOMVANG HELIPORT / OIL PLATFORM 27.350 -94.633 0.477348 -1.651657 0 0.299198 HUSTON/EXECUTIVE
0 BOOMVANG HELIPORT / OIL PLATFORM 27.350 -94.633 0.477348 -1.651657 1 1.460424 BEAUFORT
0 BOOMVANG HELIPORT / OIL PLATFORM 27.350 -94.633 0.477348 -1.651657 2 1.516741 SUMMERVILLE AIRPORT
1 LEE COUNTY AIRPORT 36.654 -83.218 0.639733 -1.452428 2 0.480926 SUMMERVILLE AIRPORT
2 ELLINGTON 35.507 -86.804 0.619714 -1.515016 2 0.658009 SUMMERVILLE AIRPORT
1 LEE COUNTY AIRPORT 36.654 -83.218 0.639733 -1.452428 0 0.098550 ELIZABETHTON MUNI
2 ELLINGTON 35.507 -86.804 0.619714 -1.515016 0 0.427798 ELIZABETHTON MUNI
3 LAWRENCEVILLE BRUNSWICK MUNI 36.773 -77.794 0.641810 -1.357761 2 0.393563 ELIZABETHTON MUNI
4 PUTNAM CO 39.630 -86.814 0.691674 -1.515190 1 0.544545 ELIZABETHTON MUNI
1 LEE COUNTY AIRPORT 36.654 -83.218 0.639733 -1.452428 1 0.387486 MARLBORO CO JETPORT H E AVENT
2 ELLINGTON 35.507 -86.804 0.619714 -1.515016 1 0.650799 MARLBORO CO JETPORT H E AVENT
3 LAWRENCEVILLE BRUNSWICK MUNI 36.773 -77.794 0.641810 -1.357761 0 0.296458 MARLBORO CO JETPORT H E AVENT
4 PUTNAM CO 39.630 -86.814 0.691674 -1.515190 2 0.838587 MARLBORO CO JETPORT H E AVENT
3 LAWRENCEVILLE BRUNSWICK MUNI 36.773 -77.794 0.641810 -1.357761 1 0.348930 CONWAY HORRY COUNTY AIRPORT
4 PUTNAM CO 39.630 -86.814 0.691674 -1.515190 0 0.496025 JACK BARSTOW AIRPORT
Cool project!