How do I map a numpy array and an indices array to a pandas dataframe?-CodePudding

I have been following this tutorial on how to find nearest neighbors of a point with scikit.

However, when it comes to displaying the data, the tutorial merely mentions that "the indices can be mapped to useful values and the two arrays merged with the rest of the data"

But there's no actual explanation on how to do this. I'm not very well-versed in Pandas and I don't know how to perform this merge, so I just end up with 2 multidimensional arrays and I don't know how to map them to the original data to study the example and experiment with it.

This is the code

import numpy as np
from sklearn.neighbors import BallTree, KDTree
import pandas as pd

# Column names for the example DataFrame.
column_names = ["STATION NAME", "LAT", "LON"]

# A list of locations that will be used to construct the binary
# tree.
locations_a = [['BEAUFORT', 32.4, -80.633],
       ['CONWAY HORRY COUNTY AIRPORT', 33.828, -79.122],
       ['HUSTON/EXECUTIVE', 29.8, -95.9],
       ['ELIZABETHTON MUNI', 36.371, -82.173],
       ['JACK BARSTOW AIRPORT', 43.663, -84.261],
       ['MARLBORO CO JETPORT H E AVENT', 34.622, -79.734],
       ['SUMMERVILLE AIRPORT', 33.063, -80.279]]

# A list of locations that will be used to construct the queries.
# for neighbors.
locations_b = [['BOOMVANG HELIPORT / OIL PLATFORM', 27.35, -94.633],
       ['LEE COUNTY AIRPORT', 36.654, -83.218],
       ['ELLINGTON', 35.507, -86.804],
       ['LAWRENCEVILLE BRUNSWICK MUNI', 36.773, -77.794],
       ['PUTNAM CO', 39.63, -86.814]]

# Converting the lists to DataFrames. We will build the tree with
# the first and execute the query on the second.

locations_a = pd.DataFrame(locations_a, columns = column_names)
locations_b = pd.DataFrame(locations_b, columns = column_names)

# Creates new columns converting coordinate degrees to radians.
for column in locations_a[["LAT", "LON"]]:
    rad = np.deg2rad(locations_a[column].values)
    locations_a[f'{column}_rad'] = rad
for column in locations_b[["LAT", "LON"]]:
    rad = np.deg2rad(locations_b[column].values)
    locations_b[f'{column}_rad'] = rad

# Takes the first group's latitude and longitude values to construct
# the ball tree.
ball = BallTree(locations_a[["LAT_rad", "LON_rad"]].values, metric='haversine')

# The amount of neighbors to return.
k = 1

# Executes a query with the second group. This will also return two
# arrays.
distances, indices = ball.query(locations_b[["LAT_rad", "LON_rad"]].values, k = k)
#converting to kilometers
distances = distances * 6.371

So how do I take distances and indices and map them to my dataframe to visually see the nearest neighbor of each point?

CodePudding user response：

Each integer index in indices refers to an index value (row number) of locations_a. You can use locations_a.loc[] to convert these indices to their corresponding station names as a numpy array:

nearest_station_names = locations_a.loc[indices.flatten()]['STATION NAME'].to_numpy()

(Why indices.flatten() instead of just indices? ball.query returns distances and indices as two-dimensional numpy arrays, where the second dimension (the number of columns) is 1. For indices to work in df.loc[], you need to "flatten" it into a one-dimensional array whose only dimension is the number of rows.)

Next, insert the names as a new column into locations_b:

locations_b['nearest_stn'] = nearest_station_names

Then insert distances as another new column (no need to .flatten in this case):

locations_b['nearest_stn_dist'] = distances

# Print without radian columns for brevity
print(locations_b.drop(columns=['LAT_rad', 'LON_rad']))

                       STATION NAME     LAT     LON                    nearest_stn  nearest_stn_km
0  BOOMVANG HELIPORT / OIL PLATFORM  27.350 -94.633               HUSTON/EXECUTIVE      299.198339
1                LEE COUNTY AIRPORT  36.654 -83.218              ELIZABETHTON MUNI       98.550423
2                         ELLINGTON  35.507 -86.804              ELIZABETHTON MUNI      427.798176
3      LAWRENCEVILLE BRUNSWICK MUNI  36.773 -77.794  MARLBORO CO JETPORT H E AVENT      296.458070
4                         PUTNAM CO  39.630 -86.814           JACK BARSTOW AIRPORT      496.025005

CodePudding user response：

As per the documentation for BallTree(), indices is a 2d array of shape (len(X), k), where X is the array supplied to the tree in the query (here it is locations_b).

So, when you query with locations_b and k=1, you receive back a 2d array of shape (5, 1) that represents the nearest neighbor of each station of locations_b as an index of the original X used to fit the BallTree, in this case locations_a.

This means you now have the indices of locations_a that represent the nearest neighbor(s) for each station in locations_b.

In the case of k=1, you can get information about the nearest station by indexing locations_a with the indices resulting from your query like so:

neighbors = pd.DataFrame({'distance': distances.flatten(), 'neighbor_idx': indices.flatten()})

>>> neighbors

   distance  neighbor_idx
0  0.299198             2
1  0.098550             3
2  0.427798             3
3  0.296458             5
4  0.496025             4

In the below code, we join locations_b to neighbors, which joins on index by default, then merge one column of locations_a, using our neighbor_idx field to match the index of locations_a.

locations_b\
    .join(neighbors)\
    .merge(locations_a[['STATION NAME']], left_on='neighbor_idx', right_index=True, suffixes=("", "_neighbor"))\
    .drop('neighbor_idx', axis=1)

                       STATION NAME     LAT     LON   LAT_rad   LON_rad  distance          STATION NAME_neighbor
0  BOOMVANG HELIPORT / OIL PLATFORM  27.350 -94.633  0.477348 -1.651657  0.299198               HUSTON/EXECUTIVE
1                LEE COUNTY AIRPORT  36.654 -83.218  0.639733 -1.452428  0.098550              ELIZABETHTON MUNI
2                         ELLINGTON  35.507 -86.804  0.619714 -1.515016  0.427798              ELIZABETHTON MUNI
3      LAWRENCEVILLE BRUNSWICK MUNI  36.773 -77.794  0.641810 -1.357761  0.296458  MARLBORO CO JETPORT H E AVENT
4                         PUTNAM CO  39.630 -86.814  0.691674 -1.515190  0.496025           JACK BARSTOW AIRPORT

You can of course choose to merge additional columns from locations_a.

In the general case of more than one nearest neighbor, the above approach won't work quite right. Here's a way to tackle that:

k=3
distances, indices = ball.query(locations_b[["LAT_rad", "LON_rad"]].values, k = k)
#converting to kilometers
distances = distances * 6.371

dists = pd.DataFrame(distances).stack()
rel = pd.DataFrame(indices).stack()
neighbor_df = pd.merge(dists.rename('distance'), rel.rename('neighbor_idx'), right_index=True, left_index=True)
neighbor_df = neighbor_df.reset_index(level=1)
neighbor_df.columns = ['neighbor_number', 'distance', 'neighbor_idx']

>>> neighbor_df

   neighbor_number  distance  neighbor_idx
0                0  0.299198             2
0                1  1.460424             0
0                2  1.516741             6
1                0  0.098550             3
1                1  0.387486             5
1                2  0.480926             6
2                0  0.427798             3
2                1  0.650799             5
2                2  0.658009             6
3                0  0.296458             5
3                1  0.348930             1
3                2  0.393563             3
4                0  0.496025             4
4                1  0.544545             3
4                2  0.838587             5

locations_b\
    .join(neighbor_df)\
    .merge(locations_a[['STATION NAME']], left_on='neighbor_idx', right_index=True, suffixes=("", "_neighbor"))\
    .drop('neighbor_idx', axis=1)

Results:


                       STATION NAME     LAT     LON   LAT_rad   LON_rad  neighbor_number  distance          STATION NAME_neighbor
0  BOOMVANG HELIPORT / OIL PLATFORM  27.350 -94.633  0.477348 -1.651657                0  0.299198               HUSTON/EXECUTIVE
0  BOOMVANG HELIPORT / OIL PLATFORM  27.350 -94.633  0.477348 -1.651657                1  1.460424                       BEAUFORT
0  BOOMVANG HELIPORT / OIL PLATFORM  27.350 -94.633  0.477348 -1.651657                2  1.516741            SUMMERVILLE AIRPORT
1                LEE COUNTY AIRPORT  36.654 -83.218  0.639733 -1.452428                2  0.480926            SUMMERVILLE AIRPORT
2                         ELLINGTON  35.507 -86.804  0.619714 -1.515016                2  0.658009            SUMMERVILLE AIRPORT
1                LEE COUNTY AIRPORT  36.654 -83.218  0.639733 -1.452428                0  0.098550              ELIZABETHTON MUNI
2                         ELLINGTON  35.507 -86.804  0.619714 -1.515016                0  0.427798              ELIZABETHTON MUNI
3      LAWRENCEVILLE BRUNSWICK MUNI  36.773 -77.794  0.641810 -1.357761                2  0.393563              ELIZABETHTON MUNI
4                         PUTNAM CO  39.630 -86.814  0.691674 -1.515190                1  0.544545              ELIZABETHTON MUNI
1                LEE COUNTY AIRPORT  36.654 -83.218  0.639733 -1.452428                1  0.387486  MARLBORO CO JETPORT H E AVENT
2                         ELLINGTON  35.507 -86.804  0.619714 -1.515016                1  0.650799  MARLBORO CO JETPORT H E AVENT
3      LAWRENCEVILLE BRUNSWICK MUNI  36.773 -77.794  0.641810 -1.357761                0  0.296458  MARLBORO CO JETPORT H E AVENT
4                         PUTNAM CO  39.630 -86.814  0.691674 -1.515190                2  0.838587  MARLBORO CO JETPORT H E AVENT
3      LAWRENCEVILLE BRUNSWICK MUNI  36.773 -77.794  0.641810 -1.357761                1  0.348930    CONWAY HORRY COUNTY AIRPORT
4                         PUTNAM CO  39.630 -86.814  0.691674 -1.515190                0  0.496025           JACK BARSTOW AIRPORT

Cool project!