I need to check if there are any identical rows in a matrix that contains integers values For example :
A = [[ 2 0 0 0 0 0 0]
[ 2 3 5 0 0 0 0]
[ 2 3 0 0 0 0 0]
[ 2 0 5 7 0 0 0]
[ 2 0 5 0 8 0 10]
[ 2 3 5 0 0 9 0]
[ 0 0 0 7 8 9 0]
[ 0 0 5 7 8 9 0]
[ 0 0 0 7 8 9 0]
[ 0 0 5 0 0 0 10]]
This must return : True
But in other example :
B = [[ 2 0 0 0 0 0 0]
[ 2 3 5 0 0 0 0]
[ 2 3 0 0 0 0 0]
[ 2 0 5 7 0 0 0]
[ 2 0 5 0 8 0 10]
[ 2 3 5 0 0 9 0]
[ 0 0 1 7 8 9 0]
[ 0 0 5 7 8 9 0]
[ 0 0 0 7 8 9 0]
[ 0 0 5 0 0 0 10]]
This must return : False
CodePudding user response:
Approach 1: Using np.unique
You can use
np.unique
over axis=0 to fetch unique rows.The
return_counts=True
will return the number of times each row repeats.Putting a condition
c>1
and checking if any of the rows matches that condition with.any()
will give you what you want.
def is_dup_simple(arr):
u, c = np.unique(arr, axis=0, return_counts=True)
return (c>1).any()
print(is_dup_simple(A))
print(is_dup_simple(B))
True
False
Approach 2: Broadcasting
Here is how you can do this with broadcasting operations. Its a slightly longer way, but let's you be very flexible with the approach (example, finding duplicates between different arrays)
def is_dup(arr):
mask = ~np.eye(arr.shape[0], dtype=bool)
out = ((arr[None,:,:] == arr[:,None,:]).all(-1)&mask).any()
return out
print(is_dup(A))
print(is_dup(B))
True
False
Step by step broadcasting table -
arr[None,:,:] -> 1 , 10, 7 (adding new first axis)
arr[:,None,:] -> 10, 1 , 7 (adding new second axis)
--------------------------
== -> 10, 10, 7 (compare elements row-wise)
all(-1) -> 10, 10 (compare rows x rows)
& mask -> 10, 10 (diagonals false)
any() -> 1 (reduce to single bool)
--------------------------
EXPLANATION
From the 10,7 array (10 rows, 7 columns), you want to match elements such that you end up with a (10,10) matrix with boolean values which indicate if all the elements in the 7 rows match ANY of the elements in any other rows.
The
mask = ~np.eye(arr.shape[0], dtype=bool)
is specifically a matrix of 10,10 shape with false values along the diagonal. The reason for this is, because you want to ignore comparing the row with itself. More about this later.Starting with the broadcasted boolean operation -
(arr[None,:,:] == arr[:,None,:])
. This results in a 10,10,7 boolean array which compares the elements of every row, with all the other rows (10 x 10 comparisons, 7 values matched).Now, with
.all(-1)
you reduce the last axis and get a 10,10 matrix which contains True, if all 7 elements match any other row, else false even if a single element is different.Next, as you realize, row 0 will always match row 0, so will row 1 match row 1. Therefore the diagonal will always be true in this matrix. For us to deduce if there are duplicate rows, we have to ignore the True values of the diagonal. This can be done by doing an
&
(and) operation between the mask (discussed above) and the 10,10 boolean array. The only change that happens because of this, is that diagonal elements become False instead of True.Finally, you can reduce the array to a single boolean by using
.any()
which will be True if even a single element in the new 10,10 matrix is True (which indicates that there is a row x that matches row y exactly AND row x is not the same as row y, thanks to the mask)