I have two large datasets A (100,000x10,000) and B (100,000x250) and I want to calculate the Pearson correlation between those.
The scipy.spatial.distance.cdist
function with metric='correlation'
does exactly what I want.
corr = 1 - cdist(A.T,B.T,'correlation')
But it takes about 5 times as long as numpy.corrcoef
although I can discard most parts of it where correlation is calculated within one of the datasets.
corr = corrcoef(np.hstack((A,B)).T)[len(A.T):,:len(A.t)].T
Is there a better way to do this fast?
CodePudding user response:
You can try this implementation, I don't have enough memory to test with your input size.
It looks like internally the function implements uses python loops here.
def pairwise_correlation(A, B):
am = A - np.mean(A, axis=0, keepdims=True)
bm = B - np.mean(B, axis=0, keepdims=True)
return am.T @ bm / (np.sqrt(
np.sum(am**2, axis=0,
keepdims=True)).T * np.sqrt(
np.sum(bm**2, axis=0, keepdims=True)))