How better perform Pearson R from 2 arrays of dimensions (m, n) and (n), returning an array of (m) s-CodePudding

I'm trying to improve a simple algorithm to obtaining the Pearson correlation coefficient from two arrays, X(m, n) and Y(n), returning me another array R of dimension (m).
In the case, I want to know the behavior each row of X regarding the values of Y. A sample (working) code is presented below:

import numpy as np
from scipy.stats import pearsonr

np.random.seed(1)
m, n = 10, 5

x = 100*np.random.rand(m, n)
y = 2   2*x.mean(0)
r = np.empty(m)

for i in range(m):
    r[i] = pearsonr(x[i], y)[0]

For this particular case, I get: r = array([0.95272843, -0.69134753, 0.36419159, 0.27467137, 0.76887201, 0.08823868, -0.72608421, -0.01224453, 0.58375626, 0.87442889])

For small values of m (near 10k) this runs pretty fast, but I'm starting to work with m ~ 30k, and so this is taking much longer than I expected. I'm aware I could implement multiprocessing/multi-threading but I believe there's a (better) pythonic way of doing this.

I tried to use use pearsonr(x, np.ones((m, n))*y), but it returns only (nan, nan).

CodePudding user response：

pearsonr only supports 1D array internally. Moreover, it computes the p-values which is not used here. Thus, it would be more efficient not to compute it if possible. Additionally, the code also recompute the y vector every time and it does not efficiently make use of vectorized Numpy operations. This is why the computation is a bit slow. You can check this in the code here.

One way to compute this is by writing your own custom implementation based on the one of Scipy:

def multi_pearsonr(x, y):
    xmean = x.mean(axis=1)
    ymean = y.mean()
    xm = x - xmean[:,None]
    ym = y - ymean
    normxm = np.linalg.norm(xm, axis=1)
    normym = np.linalg.norm(ym)
    return np.clip(np.dot(xm/normxm[:,None], ym/normym), -1.0, 1.0)

It is 450 times faster on my machine for m = 10_000.

Note that I did not keep the checks of the Scipy code, but it may be a good idea to keep them if your input is not guaranteed to be statistically safe (ie. well formatted for the computation of the Pearson test).