I'm trying to improve a simple algorithm to obtaining the Pearson correlation coefficient from two arrays, X(m, n) and Y(n), returning me another array R of dimension (m).
In the case, I want to know the behavior each row of X regarding the values of Y. A sample (working) code is presented below:
import numpy as np
from scipy.stats import pearsonr
np.random.seed(1)
m, n = 10, 5
x = 100*np.random.rand(m, n)
y = 2 2*x.mean(0)
r = np.empty(m)
for i in range(m):
r[i] = pearsonr(x[i], y)[0]
For this particular case, I get: r = array([0.95272843, -0.69134753, 0.36419159, 0.27467137, 0.76887201, 0.08823868, -0.72608421, -0.01224453, 0.58375626, 0.87442889])
For small values of m (near 10k) this runs pretty fast, but I'm starting to work with m ~ 30k, and so this is taking much longer than I expected. I'm aware I could implement multiprocessing/multi-threading but I believe there's a (better) pythonic way of doing this.
I tried to use use pearsonr(x, np.ones((m, n))*y)
, but it returns only (nan, nan)
.
CodePudding user response:
pearsonr
only supports 1D array internally. Moreover, it computes the p-values which is not used here. Thus, it would be more efficient not to compute it if possible. Additionally, the code also recompute the y
vector every time and it does not efficiently make use of vectorized Numpy operations. This is why the computation is a bit slow. You can check this in the code here.
One way to compute this is by writing your own custom implementation based on the one of Scipy:
def multi_pearsonr(x, y):
xmean = x.mean(axis=1)
ymean = y.mean()
xm = x - xmean[:,None]
ym = y - ymean
normxm = np.linalg.norm(xm, axis=1)
normym = np.linalg.norm(ym)
return np.clip(np.dot(xm/normxm[:,None], ym/normym), -1.0, 1.0)
It is 450 times faster on my machine for m = 10_000.
Note that I did not keep the checks of the Scipy code, but it may be a good idea to keep them if your input is not guaranteed to be statistically safe (ie. well formatted for the computation of the Pearson test).