Home > Software engineering >  Element-wise comparison of numpy arrays (Python)
Element-wise comparison of numpy arrays (Python)

Time:12-17

I would like to ask a question for a numpy array below.

I have a dataset, which has 50 rows and 15 columns and I created a numpy array as such:

x=x.to_numpy()

My aim is compare each row with other rows(elementwise and except itself) and find whether if there is any row which all values smaller than that row.

Sample table:

a b c         
1 6 2
2 6 8
4 7 12
7 9 13

for example for row 1 and row2 there is no such a row. But rows 3,4 there is a row which all values of row 1 and row 2 are smaller than all those. So the algorithm should return the count 2 (which indicates the row 3 and 4).

Which Python code should be implemented to get this particular return.

I have tried a bunch of code, but could not reach a proper solution. So if anyone has an idea on that I would be appreciated.

CodePudding user response:

One-liner (with one loop. Not ideal, but better than 2 loops)

sum((r>x).all(axis=1).any() for r in x)

r>x is an array of booleans comparing each elemnts of row r to each element of x. So, for example, when r is row x[2], then r>x is

array([[ True,  True,  True],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False]])

So (r>x).all(axis=1) is a shape (4,) array of booleans telling if all booleans in each line (because .all iterates through columns only, axis=1) are True or not. In previous example, that would be [True, True, False, False]. (x[1]>x).all(axis=1) would be [False, False, False, False] (first line of x[1]>x contains 2 True, but that is not enough for .all)

So (r>x).all(axis=1).any() tells what you want to know: if there is any line whose all columns are True. That is if there is any True in previous array.

((r>x).all(axis=1).any() for r in x) is an iterator of this computation for all rows r of x. If you replaced the outer ( ) by [, ], you would get a list of True and False (False, False, True, True, to be accurate, as you've alraedy said: False for 1st two rows, True for two others). But no need to build a list here, since we just want to count. A compound iterator will produce result only as the caller will require, and here, the caller is sum.

sum((r>x).all(axis=1).any() for r in x) counts the number of times we get True in the previous computation.

(In this case, since there are only 4 elements in the list, it is not like I was sparing much memory by using a compound iterator rather than a compound list. But it is a good habit to try to favor compound iterator when we don't really need to build a list of all intermediary results in memory)

Timings

For your example, computation takes 48 μs (compared to 115 μs for corrected version from di.bezrukov).

But difference shows when the number of rows grows. For 10000×3 data, then, computation takes 3.9 seconds, when di.bezrukov's method takes 353 seconds.

Still, both methods are O(n²). O(n²×m) even, if we call m the number of columns

We both have 3 nested loops. Di.bezrukov's have two explicit python for loop, and one implicit loop in the .all (still a for loop, even if it is done in numpy's internal code). I have 1 python compound for loop, and 2 implicit loops .all and .any.

So same time structure. Only numpy's loop are faster.

CodePudding user response:

Just use two loops and compare

import numpy as np

def f(x):
    count = 0

    for i in range(x.shape[0]):
        for j in range(x.shape[0]):
            if i == j:
                continue
            if np.all(x[i] > x[j]):
                count  = 1
                break

    return count

x = np.array([[1, 6, 2], [2, 6, 8], [4, 7, 12], [7, 9, 13]])
print(f(x))
  • Related