I'm trying to remove rows from a DataFrame that are within a Euclidean distance threshold of other points listed in the DataFrame. So for example, in the small DataFrame provided below, two rows would be removed if a threshold
value was set equal to 0.001 (1 mm: thresh = 0.001
), where X
and Y
are spatial coordinates:
import pandas as pd
data = {'X': [0.075, 0.0791667,0.0749543,0.0791184,0.075,0.0833333, 0.0749543],
'Y': [1e-15, 0,-0.00261746,-0.00276288, -1e-15,0,-0.00261756],
'T': [12.57,12.302,12.56,12.292,12.57,12.052,12.56]}
df = pd.DataFrame(data)
df
# X Y T
# 0 0.075000 1.000000e-15 12.570
# 1 0.079167 0.000000e 00 12.302
# 2 0.074954 -2.617460e-03 12.560
# 3 0.079118 -2.762880e-03 12.292
# 4 0.075000 -1.000000e-15 12.570
# 5 0.083333 0.000000e 00 12.052
# 6 0.074954 -2.617560e-03 12.560
The rows with indices 4 and 6 need to be removed because they are spatial duplicates of rows 0 and 2, respectively, since they are within the specified threshold distance of previously listed points. Also, I always want to remove the 2nd occurrence of a point that is within the threshold distance of a previous point. What's the best way to approach this?
CodePudding user response:
You mentioned the key words distance , so we do cdist
from scipy
from scipy.spatial.distance import cdist
v = df[['X','Y']]
ary = cdist(v, v, metric='euclidean')
df[~np.tril(ary<0.001,k = -1).any(1)]
Out[100]:
X Y T
0 0.075000 1.000000e-15 12.570
1 0.079167 0.000000e 00 12.302
2 0.074954 -2.617460e-03 12.560
3 0.079118 -2.762880e-03 12.292
5 0.083333 0.000000e 00 12.052
CodePudding user response:
Let's try it with this one. Calculate the Euclidean distance for each pair of (X,Y), which creates a symmetric matrix. Then mask the upper half; then for the lower half, filter out the rows where there is a value less than thresh
:
import numpy as np
m = np.tril(np.sqrt(np.power(df[['X']].to_numpy() - df['X'].to_numpy(), 2)
np.power(df[['Y']].to_numpy() - df['Y'].to_numpy(), 2)))
m[np.triu_indices(m.shape[0])] = np.nan
out = df[~np.any(m < thresh, axis=1)]
We could also write it a bit more concisely and legibly (taking a leaf out of @BENY's elegant solution) by using k
parameter in numpy.tril
to directly mask the upper half of the symmetric matrix:
distances = np.sqrt(np.sum([(df[[c]].to_numpy() - df[c].to_numpy())**2
for c in ('X','Y')], axis=0))
msk = np.tril(distances < thresh, k=-1).any(axis=1)
out = df[~msk]
Output:
X Y T
0 0.075000 1.000000e-15 12.570
1 0.079167 0.000000e 00 12.302
2 0.074954 -2.617460e-03 12.560
3 0.079118 -2.762880e-03 12.292
5 0.083333 0.000000e 00 12.052