Home > Back-end >  Find the closest index with "True" and calculating the distance (Pandas)
Find the closest index with "True" and calculating the distance (Pandas)

Time:05-09

I have a DataFrame like this:

idx Var1 Var2 Var3
0 True False False
1 False True False
2 True False True
3 False False False
4 True False True

I'd like to create three new columns with the distance (from each row) of the closest True, and if that row has a True show 0, so I would get this:

idx Var1 Var2 Var3 distV1 distV2 distV3
0 True False False 0 1 2
1 False True False 1 0 1
2 True False True 0 1 0
3 False False False 1 2 1
4 True False True 0 3 0

I have read all other discussions related to this topic but haven't been able to find an answer for something like this.

CodePudding user response:

Here is one approach with numpy ops:

for c in df:
    r = np.where(df[c])[0]
    d = abs(df.index.values[:, None] - r)
    df[f'{c}_dist'] = abs(df.index - r[d.argmin(1)])

print(df)

    Var1   Var2   Var3  Var1_dist  Var2_dist  Var3_dist
0   True  False  False          0          1          2
1  False   True  False          1          0          1
2   True  False   True          0          1          0
3  False  False  False          1          2          1
4   True  False   True          0          3          0

CodePudding user response:

Code

from scipy.spatial import KDTree

array = df.to_numpy()
bmp = array.astype(np.uint8)
all_points = np.argwhere(bmp!=2)
true_points = np.argwhere(bmp==1)
distance = tree.query(points, k=1, p=1)[0]
distance.resize(array.shape)
df[[c "_dist" for c in df.columns]] = distance.astype(int)

Output

      Var1   Var2   Var3  Var1_dist  Var2_dist  Var3_dist
idx                                                      
0     True  False  False          0          1          2
1    False   True  False          1          0          1
2     True  False   True          0          1          0
3    False  False  False          1          2          1
4     True  False   True          0          1          0

Explain

  1. Using np.array to make 0,1 data
array([[1, 0, 0],
       [0, 1, 0],
       [1, 0, 1],
       [0, 0, 0],
       [1, 0, 1]], dtype=uint8)
  1. argwhere will return the position coordinate for eligible points.

  2. KDTree is a classical algorithm to find the nearest point.

    1. arg k means the top n nearest points

    2. arg p=1 means "Manhattan" distance

    Which Minkowski p-norm to use.

    1 is the sum-of-absolute-values distance ("Manhattan" distance).

    2 is the usual Euclidean distance.

Reference

scipy.KDTree

  • Related