As seen in the picture I have an outlier and I would like to remove it(not the red one but the one above it in green, which is not aligned with other points) and hence I am trying to find the min distance and then try to eliminate it. But given the huge dataset it takes an eternity to execute. This is my code below. Appreciate any solution that helps, thanks! enter image description here
import math
#list of 11600 points
dataset = [[2478, 3534], [4217, 953],......,11600 points]
copy_dataset = dataset
Indices =[]
Min_Dists =[]
Distance = []
Copy_Dist=[]
for p1 in range(len(dataset)):
p1_x= dataset[p1][0]
p1_y= dataset[p1][1]
for p2 in range(len(copy_dataset)):
p2_x= copy_dataset[p2][0]
p2_y= copy_dataset[p2][1]
dist = math.sqrt((p1_x - p2_x) ** 2 (p1_y - p2_y) ** 2)
Distance.append(dist)
Copy_Dist.append(dist)
min_dist_1= min(Distance)
Distance.remove(min_dist_1)
if(min_dist_1 !=0):
Min_Dists.append(min_dist_1)
ind_1 = Copy_Dist.index(min_dist_1)
Indices.append(ind_1)
min_dist_2=min(Distance)
Distance.remove(min_dist_2)
if(min_dist_2 !=0):
Min_Dists.append(min_dist_2)
ind_2 = Copy_Dist.index(min_dist_2)
Indices.append(ind_2)
To_Remove = copy_dataset.index([p1_x, p1_y])
copy_dataset.remove(copy_dataset[To_Remove])
CodePudding user response:
Not sure how to solve this problem in general, but it's probably a lot faster to compute the distances in a vectorized fashion.
dataset_copy = dataset.copy()
dataset_copy = dataset_copy[:, np.newaxis]
distance = np.sqrt(np.sum(np.square(dataset - dataset_copy), axis=~0))
CodePudding user response:
If you need all distances at once:
distances = scipy.spatial.distance_matrix(dataset, dataset)
If you need distances of one point to all others:
for pt in dataset:
distances = scipy.spatial.distance_matrix([pt], dataset)[0]
# distances.min() will be 0 because the point has 0 distance to itself
# the nearest neighbor will be the second element in sorted order
indices = np.argpartition(distances, 1) # or use argsort for a complete sort
nearest_neighbor = indices[1]
Documentation: distance_matrix
, argpartition