Getting closest observations using Euclidean distance-CodePudding

I have a data frame that looks like this

ID PC1   PC2
12 0.355 0.362
24 0.577 0.425
15 0.257 0.486
06 0.585 0.254
34 0.367 0.533

I want to use Euclidean distance on PC1 and PC2 to get the closest data points.

So my desired out is that when I put in ID = 15 in python, I want it to give me a list of 3 data points that are the closest. Desired output would look like this.

ID PC1   PC2   Distance
34 0.367 0.533 ###
06 0.585 0.254 ###
12 0.355 0.462 ###

The numbers above are made up, but that’s what I want my output to look like. ‘Distance’ is the distance from the input data point (ID=15) to each respective data point in the output.

I know this is fairly simple, but I am a beginner python learner. I am a little unsure of how to approach this so can someone help me out? Thanks.

CodePudding user response：

Try:

ID = '15'

# Extract PC1, PC2 from current point (ID='15')
point = df.loc[df['ID'] == ID, ['PC1', 'PC2']].squeeze()

# Keep all other points
others = df.loc[df['ID'] != ID, ['PC1', 'PC2']]

# Compute the distance between the point and the other points
# and keep the 3 points with the smallest distance
dist = (others - point).pow(2).sum(1).pow(0.5).nsmallest(3)
out = df.loc[dist.index].assign(Distance=dist)

Output:

>>> out
   ID    PC1    PC2  Distance
4  34  0.367  0.533  0.119620
0  12  0.355  0.362  0.158051
1  24  0.577  0.425  0.325762

CodePudding user response：

If you need to to query many points, you may want to construct a KDTree first. Here is an example using scipy:

from scipy import spatial

# WARNING: This assumes that all points in the DataFrame are distinct.

# construct a KDTree given a set of points
tree = spatial.cKDTree(df[["PC1", "PC2"]])
# get four = three   one points closest to a given point (index 15 in this case)
# the query point itself will be part of the output
distances, indices = tree.query(df.loc[15].values, k=3 1)
# get the df of near points
df_near = df.iloc[indices[1:]].assign(Distance=distances[1:])

print(df_near)
#       PC1    PC2  Distance
# ID                        
# 34  0.367  0.533  0.119620
# 12  0.355  0.362  0.158051
# 24  0.577  0.425  0.325762