I have a data frame that looks like this
ID PC1 PC2
12 0.355 0.362
24 0.577 0.425
15 0.257 0.486
06 0.585 0.254
34 0.367 0.533
I want to use Euclidean distance on PC1 and PC2 to get the closest data points.
So my desired out is that when I put in ID = 15
in python, I want it to give me a list of 3 data points that are the closest. Desired output would look like this.
ID PC1 PC2 Distance
34 0.367 0.533 ###
06 0.585 0.254 ###
12 0.355 0.462 ###
The numbers above are made up, but that’s what I want my output to look like. ‘Distance’ is the distance from the input data point (ID=15) to each respective data point in the output.
I know this is fairly simple, but I am a beginner python learner. I am a little unsure of how to approach this so can someone help me out? Thanks.
CodePudding user response:
Try:
ID = '15'
# Extract PC1, PC2 from current point (ID='15')
point = df.loc[df['ID'] == ID, ['PC1', 'PC2']].squeeze()
# Keep all other points
others = df.loc[df['ID'] != ID, ['PC1', 'PC2']]
# Compute the distance between the point and the other points
# and keep the 3 points with the smallest distance
dist = (others - point).pow(2).sum(1).pow(0.5).nsmallest(3)
out = df.loc[dist.index].assign(Distance=dist)
Output:
>>> out
ID PC1 PC2 Distance
4 34 0.367 0.533 0.119620
0 12 0.355 0.362 0.158051
1 24 0.577 0.425 0.325762
CodePudding user response:
If you need to to query many points, you may want to construct a KDTree
first. Here is an example using scipy
:
from scipy import spatial
# WARNING: This assumes that all points in the DataFrame are distinct.
# construct a KDTree given a set of points
tree = spatial.cKDTree(df[["PC1", "PC2"]])
# get four = three one points closest to a given point (index 15 in this case)
# the query point itself will be part of the output
distances, indices = tree.query(df.loc[15].values, k=3 1)
# get the df of near points
df_near = df.iloc[indices[1:]].assign(Distance=distances[1:])
print(df_near)
# PC1 PC2 Distance
# ID
# 34 0.367 0.533 0.119620
# 12 0.355 0.362 0.158051
# 24 0.577 0.425 0.325762