Suppose I have a list that stores many 2D points. In this list, some positions are stored the same points, consider the index of positions that stored the same point as an index pair. I want to find all the pairs in the list and return all 2 by 2 index pairs. It is possible that the list has some points repeated more than two times, but only the first match needs to be treated as a pair.
For example, in the below list, I have 9 points in total and there are 5 positions containing repeated points. The indices 0, 3, and 7 store the same point ([1, 1]
), and the indicies 1 and 6 store the same point ([2, 3]
).
[[1, 1], [2, 3], [1, 4], [1, 1], [10, 3], [5, 2], [2, 3], [1, 1], [3, 4]]
So, for this list, I want to return the index pair as (index 0, index 3) and (index 1, index 6). The only solution I can come up with is doing this is through nested loops, which I code up as following
A = np.array([[1, 1], [2, 3], [1, 4], [1, 1], [10, 3], [5, 2], [2, 3], [1, 1], [3, 4]], dtype=int)
# I don't want to modified the original list, looping through a index list insted.
Index = np.arange(0, A.shape[0], 1, dtype=int)
Pair = [] # for store the index pair
while Index.size != 0:
current_index = Index[0]
pi = A[current_index]
Index = np.delete(Index, 0, 0)
for j in range(Index.shape[0]):
pj = A[Index[j]]
distance = linalg.norm(pi - pj, ord=2, keepdims=True)
if distance == 0:
Pair.append([current_index, Index[j]])
Index = np.delete(Index, j, 0)
break
While this code works for me but the time complexity is O(n^2)
, where n == len(A)
, I'm wondering if is there any more efficient way to do this job with a lower time complexity. Thanks for any ideas and help.
CodePudding user response:
You can use a dictionary to keep track of the indices for each point.
Then, you can iterate over the items in the dictionary, printing out the indices corresponding to points that appear more than once. The runtime of this procedure is linear, rather than quadratic, in the number of points in A
:
points = {}
for index, point in enumerate(A):
point_tuple = tuple(point)
if point_tuple not in points:
points[point_tuple] = []
points[point_tuple].append(index)
for point, indices in points.items():
if len(indices) > 1:
print(indices)
This prints out:
[0, 3, 7]
[1, 6]
If you only want the first two indices where a point appears, you can use print(indices[:2])
rather than print(indices)
.
CodePudding user response:
This is similar to the other answer, but since you only want the first two in the event of multiple pairs you can do it in a single iteration. Add the indices under the appropriate key in a dict and yield the indices if (and only if) there are two points:
from collections import defaultdict
l = [[1, 1], [2, 3], [1, 4], [1, 1], [10, 3], [5, 2], [2, 3], [1, 1], [3, 4]]
def get_pairs(l):
ind = defaultdict(list)
for i, pair in enumerate(l):
t = tuple(pair)
ind[t].append(i)
if len(ind[t]) == 2:
yield list(ind[t])
list(get_pairs(l))
# [[0, 3], [1, 6]]
CodePudding user response:
One pure-Numpy solution without loops (the only one so far) is to use np.unique
twice with a trick that consists in removing the first items found between the two searches. This solution assume a sentinel can be set (eg. -1, the minimum value of an integer, NaN) which is generally not a problem (you can use bigger types if needed).
A = np.array([[1, 1], [2, 3], [1, 4], [1, 1], [10, 3], [5, 2], [2, 3], [1, 1], [3, 4]], dtype=int)
# Copy the array not to mutate it
tmp = A.copy()
# Find the location of unique values
pair1, index1 = np.unique(tmp, return_index=True, axis=0)
# Discard the element found assuming -1 is never stored in A
INT_MIN = np.iinfo(A.dtype).min
tmp[index1] = INT_MIN
# Find the location of duplicated values
pair2, index2 = np.unique(tmp, return_index=True, axis=0)
# Extract the indices that share the same pair of values found
left = index1[np.isin(pair1, pair2).all(axis=1)]
right = index2[np.isin(pair2, pair1).all(axis=1)]
# Combine the each left index with each right index
result = np.hstack((left[:,None], right[:,None]))
# result = array([[0, 3],
# [1, 6]])
This solution should run in O(n log n)
time as np.unique
uses a basic sort internally (more specifically quick-sort).