I have pd.DataFrame
containing rows of values:
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3, 4, 5, 6], "col2": [6, 5, 4, 3, 2, 1]})
I now want to find an efficient way to create a np.array
matrix based on the output from a function applied to both columns:
def my_function(x1, x2, y1, y2):
return x1 > y1 and x2 < y2
The naive O(N²) way of solving this would be as follows:
matrix = []
for _, (x1, x2) in df.iterrows():
row = []
for _, (y1, y2) in df.iterrows():
row.append(my_function(x1, x2, y1, y2))
matrix.append(row)
Giving us:
>>> print(np.array(matrix))
array([[False, False, False, False, False, False],
[ True, False, False, False, False, False],
[ True, True, False, False, False, False],
[ True, True, True, False, False, False],
[ True, True, True, True, False, False],
[ True, True, True, True, True, False]])
Is there a more efficient way that scales to more values?
CodePudding user response:
You can try np.vectorize
def my_function(x, y):
x1, x2 = x
y1, y2 = y
return x1 > y1 and x2 < y2
arr = df.to_records(index=False)
f_vfunc = np.vectorize(my_function)
r = f_vfunc(arr[:, None], arr)
print(r)
[[False False False False False False]
[ True False False False False False]
[ True True False False False False]
[ True True True False False False]
[ True True True True False False]
[ True True True True True False]]
CodePudding user response:
numpy.vectorize
is not needed here you can directly and easily write a vectorial code (and vectorize
does not improve speed, it act as a loop):
a = df['col1'].to_numpy()
b = df['col2'].to_numpy()
matrix = (a[:,None]>a)&(b[:,None]<b)
output:
array([[False, False, False, False, False, False],
[ True, False, False, False, False, False],
[ True, True, False, False, False, False],
[ True, True, True, False, False, False],
[ True, True, True, True, False, False],
[ True, True, True, True, True, False]])
speed comparison:
%%timeit
f_vfunc(arr[:, None], arr)
37.2 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
(a[:,None]>a)&(b[:,None]<b)
2.44 µs ± 84.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)