Home > Software engineering >  Create numpy array from function applied to (multiple) pandas columns
Create numpy array from function applied to (multiple) pandas columns

Time:05-06

I have pd.DataFrame containing rows of values:

import pandas as pd

df = pd.DataFrame({"col1": [1, 2, 3, 4, 5, 6], "col2": [6, 5, 4, 3, 2, 1]})

I now want to find an efficient way to create a np.array matrix based on the output from a function applied to both columns:

def my_function(x1, x2, y1, y2):
    return x1 > y1 and x2 < y2

The naive O(N²) way of solving this would be as follows:

matrix = []
for _, (x1, x2) in df.iterrows():
    row = []
    for _, (y1, y2) in df.iterrows():
        row.append(my_function(x1, x2, y1, y2))
    matrix.append(row)

Giving us:

>>> print(np.array(matrix))

array([[False, False, False, False, False, False],
       [ True, False, False, False, False, False],
       [ True,  True, False, False, False, False],
       [ True,  True,  True, False, False, False],
       [ True,  True,  True,  True, False, False],
       [ True,  True,  True,  True,  True, False]])

Is there a more efficient way that scales to more values?

CodePudding user response:

You can try np.vectorize

def my_function(x, y):
    x1, x2 = x
    y1, y2 = y
    return x1 > y1 and x2 < y2


arr = df.to_records(index=False)
f_vfunc = np.vectorize(my_function)
r = f_vfunc(arr[:, None], arr)
print(r)

[[False False False False False False]
 [ True False False False False False]
 [ True  True False False False False]
 [ True  True  True False False False]
 [ True  True  True  True False False]
 [ True  True  True  True  True False]]

CodePudding user response:

numpy.vectorize is not needed here you can directly and easily write a vectorial code (and vectorize does not improve speed, it act as a loop):

a = df['col1'].to_numpy()
b = df['col2'].to_numpy()

matrix = (a[:,None]>a)&(b[:,None]<b)

output:

array([[False, False, False, False, False, False],
       [ True, False, False, False, False, False],
       [ True,  True, False, False, False, False],
       [ True,  True,  True, False, False, False],
       [ True,  True,  True,  True, False, False],
       [ True,  True,  True,  True,  True, False]])

speed comparison:

%%timeit
f_vfunc(arr[:, None], arr)
37.2 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
(a[:,None]>a)&(b[:,None]<b)
2.44 µs ± 84.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  • Related