np.corrcoef
takes two arguments and they must have the same dimensions. In my case datax is a n by n array and datay is a n by 1 array. I want to vectorize this operation so I don't have to use loops to find my results. I think np.vectorize
is my answer but nothing I have tried gives me a result. Here is my last try in things I have tried:
def f(datax, datay):
return np.corrcoef(data,datay)
result = np.vectorize(f, dtype=np.ndarray)
CodePudding user response:
np.vectorize()
is not really for performance. Most numpy operations are vectorized anyway.
I assume you're trying to calculate correlations to y
columnwise.
Let's test it out (I used a small 400-ish lines dataframe), a naive for
loop would be relatively slow indeed:
%%timeit
[np.corrcoef(X_train[:,i], y_train)[0,1] for i in range(10)]
459 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
A 'proper' vectorized version should do something like:
def f(datax, datay):
return np.corrcoef(datax, datay, rowvar=False)
result = np.vectorize(f, signature="(m,n),(m)->(k,k)")
%%timeit
result(X_train, y_train)[-1,0:X_train[0].size]
121 µs ± 84.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Much better! But alas, np.corrcoef()
is already better vectorized:
%%timeit
np.corrcoef(X_train, y_train, rowvar=False)[-1,0:X_train[0].size]
64.7 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
That's basically twice as fast.
If you really wish to speed it up however, einsum comes to mind: (adapted from this question)
def columnwisecorrcoef(O, P):
n = np.double(P.size)
DO = O - (np.einsum('ij->j', O) / n)
PO = P - (np.einsum('i->', P) / n)
tmp = np.einsum('ij,ij->j', DO, DO)
tmp *= np.einsum('i,i->', PO, PO)
return np.dot(PO, DO) / np.sqrt(tmp)
%%timeit
columnwisecorrcoef(X_train, y_train)
24.8 µs ± 45.1 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)