Remove rows from DataFrame when X, Y coordinates are within a threshold distance of another row-CodePudding

I'm trying to remove rows from a DataFrame that are within a Euclidean distance threshold of other points listed in the DataFrame. So for example, in the small DataFrame provided below, two rows would be removed if a threshold value was set equal to 0.001 (1 mm: thresh = 0.001), where X and Y are spatial coordinates:

import pandas as pd
data = {'X': [0.075, 0.0791667,0.0749543,0.0791184,0.075,0.0833333, 0.0749543],
        'Y': [1e-15, 0,-0.00261746,-0.00276288, -1e-15,0,-0.00261756],
        'T': [12.57,12.302,12.56,12.292,12.57,12.052,12.56]}

df = pd.DataFrame(data)

df
#           X             Y       T
# 0  0.075000  1.000000e-15  12.570
# 1  0.079167  0.000000e 00  12.302
# 2  0.074954 -2.617460e-03  12.560
# 3  0.079118 -2.762880e-03  12.292
# 4  0.075000 -1.000000e-15  12.570
# 5  0.083333  0.000000e 00  12.052
# 6  0.074954 -2.617560e-03  12.560

The rows with indices 4 and 6 need to be removed because they are spatial duplicates of rows 0 and 2, respectively, since they are within the specified threshold distance of previously listed points. Also, I always want to remove the 2nd occurrence of a point that is within the threshold distance of a previous point. What's the best way to approach this?

CodePudding user response：

You mentioned the key words distance , so we do cdist from scipy

from scipy.spatial.distance import cdist
v = df[['X','Y']]
ary = cdist(v, v, metric='euclidean')

df[~np.tril(ary<0.001,k = -1).any(1)]
Out[100]: 
          X             Y       T
0  0.075000  1.000000e-15  12.570
1  0.079167  0.000000e 00  12.302
2  0.074954 -2.617460e-03  12.560
3  0.079118 -2.762880e-03  12.292
5  0.083333  0.000000e 00  12.052

CodePudding user response：

Let's try it with this one. Calculate the Euclidean distance for each pair of (X,Y), which creates a symmetric matrix. Then mask the upper half; then for the lower half, filter out the rows where there is a value less than thresh:

import numpy as np
m = np.tril(np.sqrt(np.power(df[['X']].to_numpy() - df['X'].to_numpy(), 2)   
                    np.power(df[['Y']].to_numpy() - df['Y'].to_numpy(), 2)))
m[np.triu_indices(m.shape[0])] = np.nan
out = df[~np.any(m < thresh, axis=1)]

We could also write it a bit more concisely and legibly (taking a leaf out of @BENY's elegant solution) by using k parameter in numpy.tril to directly mask the upper half of the symmetric matrix:

distances = np.sqrt(np.sum([(df[[c]].to_numpy() - df[c].to_numpy())**2 
                            for c in ('X','Y')], axis=0))
msk = np.tril(distances < thresh, k=-1).any(axis=1)
out = df[~msk]

Output:

          X             Y       T
0  0.075000  1.000000e-15  12.570
1  0.079167  0.000000e 00  12.302
2  0.074954 -2.617460e-03  12.560
3  0.079118 -2.762880e-03  12.292
5  0.083333  0.000000e 00  12.052