I would like to ask a question for a numpy array below.
I have a dataset, which has 50 rows and 15 columns
and I created a numpy array as such:
x=x.to_numpy()
My aim is compare each row with other rows(elementwise and except itself) and find whether if there is any row which all values smaller than that row.
Sample table:
a b c
1 6 2
2 6 8
4 7 12
7 9 13
for example for row 1 and row2 there is no such a row. But rows 3,4 there is a row which all values of row 1 and row 2 are smaller than all those. So the algorithm should return the count 2 (which indicates the row 3 and 4).
Which Python code should be implemented to get this particular return.
I have tried a bunch of code, but could not reach a proper solution. So if anyone has an idea on that I would be appreciated.
CodePudding user response:
One-liner (with one loop. Not ideal, but better than 2 loops)
sum((r>x).all(axis=1).any() for r in x)
r>x
is an array of booleans comparing each elemnts of row r
to each element of x
. So, for example, when r
is row x[2]
, then r>x
is
array([[ True, True, True],
[ True, True, True],
[False, False, False],
[False, False, False]])
So (r>x).all(axis=1)
is a shape (4,)
array of booleans telling if all booleans in each line (because .all
iterates through columns only, axis=1
) are True or not. In previous example, that would be [True, True, False, False]
. (x[1]>x).all(axis=1)
would be [False, False, False, False]
(first line of x[1]>x
contains 2 True
, but that is not enough for .all
)
So (r>x).all(axis=1).any()
tells what you want to know: if there is any line whose all columns are True
. That is if there is any True in previous array.
((r>x).all(axis=1).any() for r in x)
is an iterator of this computation for all rows r
of x. If you replaced the outer (
)
by [
, ]
, you would get a list of True
and False
(False, False, True, True, to be accurate, as you've alraedy said: False for 1st two rows, True for two others). But no need to build a list here, since we just want to count. A compound iterator will produce result only as the caller will require, and here, the caller is sum
.
sum((r>x).all(axis=1).any() for r in x)
counts the number of times we get True
in the previous computation.
(In this case, since there are only 4 elements in the list, it is not like I was sparing much memory by using a compound iterator rather than a compound list. But it is a good habit to try to favor compound iterator when we don't really need to build a list of all intermediary results in memory)
Timings
For your example, computation takes 48 μs (compared to 115 μs for corrected version from di.bezrukov).
But difference shows when the number of rows grows. For 10000×3 data, then, computation takes 3.9 seconds, when di.bezrukov's method takes 353 seconds.
Still, both methods are O(n²). O(n²×m) even, if we call m
the number of columns
We both have 3 nested loops. Di.bezrukov's have two explicit python for
loop, and one implicit loop in the .all
(still a for loop, even if it is done in numpy's internal code).
I have 1 python compound for
loop, and 2 implicit loops .all
and .any
.
So same time structure. Only numpy's loop are faster.
CodePudding user response:
Just use two loops and compare
import numpy as np
def f(x):
count = 0
for i in range(x.shape[0]):
for j in range(x.shape[0]):
if i == j:
continue
if np.all(x[i] > x[j]):
count = 1
break
return count
x = np.array([[1, 6, 2], [2, 6, 8], [4, 7, 12], [7, 9, 13]])
print(f(x))