I have an array X with dimension mxn, for every row m I want to get a correlation with a vector y with dimension n.
In Matlab this would be possible with the corr function corr(X,y). For Python however this does not seem possible with the np.corrcoef function:
import numpy as np
X = np.random.random([1000, 10])
y = np.random.random(10)
np.corrcoef(X,y).shape
Which results in shape (1001, 1001). But this will fail when the dimension of X is large. In my case, there is an error:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 5.93 TiB for an array with shape (902630, 902630) and data type float64
Since the X.shape[0] dimension is 902630.
My question is, how can I only get the row wise correlations with the vector resulting in shape (1000,) of all correlations?
Of course this could be done via a list comprehension:
np.array([np.corrcoef(X[i, :], y)[0,1] for i in range(X.shape[0])])
Currently I am therefore using numba with a for loop running through the >900000 elemens. But I think there could be a much more efficient matrix operation function for this problem.
EDIT: Pandas provides with the corrwith function also a method for this problem:
X_df = pd.DataFrame(X)
y_s = pd.Series(y)
X_df.corrwith(y_s)
The implementation allows for different correlation type calculations, but does not seem to be implemmented as a matrix operation and is therefore really slow. Probably there is a more efficient implementation.
CodePudding user response:
This should work to compute the correlation coefficient for each row with a specified y in a vectorized manner.
X = np.random.random([1000, 10])
y = np.random.random(10)
r = (len(y) * np.sum(X * y[None, :], axis=-1) - (np.sum(X, axis=-1) * np.sum(y))) / (np.sqrt((len(y) * np.sum(X**2, axis=-1) - np.sum(X, axis=-1) ** 2) * (len(y) * np.sum(y**2) - np.sum(y)**2)))
print(r[0], np.corrcoef(X[0], y))
0.4243951, 0.4243951