I have 2 numpy arrays. x is a 2-d array with 9 features/columns and 536 rows and y is a 1-d array with 536 rows. demonstrated below
>>> x.shape
(536, 9)
>>> y.shape
(536,)
I am trying to find the correlation coefficients between x and y.
>>> np.corrcoef(x,y)
Here's the error I am seeing.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 5, in corrcoef
File "/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py", line 2683, in corrcoef
c = cov(x, y, rowvar, dtype=dtype)
File "<__array_function__ internals>", line 5, in cov
File "/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py", line 2477, in cov
X = np.concatenate((X, y), axis=0)
File "<__array_function__ internals>", line 5, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 9 and the array at index 1 has size 536
Can't seem to figure out what the shape of these 2 should be.
CodePudding user response:
For same shape is possible use numpy.broadcast_to
:
y1 = np.broadcast_to(y[:, None], x.shape)
#alternative solution
y1 = np.repeat(y[:, None], x.shape[1],1)
print (np.corrcoef(x,y1))
Sample:
np.random.seed(1609)
x = np.random.random((5,3))
y = np.random.random((5))
print (x)
[[3.28341891e-01 9.10078695e-01 6.25727436e-01]
[9.52999512e-01 3.54590864e-02 4.19920842e-01]
[2.46229526e-02 3.60903454e-01 9.96143110e-01]
[8.87331773e-01 8.34857105e-04 6.36058323e-01]
[2.91490345e-01 5.01580494e-01 3.23455182e-01]]
print (y)
[0.60437973 0.74687751 0.68819022 0.19104546 0.68420365]
y1 = np.broadcast_to(y[:, None], x.shape)
print(y1)
[[0.60437973 0.60437973 0.60437973]
[0.74687751 0.74687751 0.74687751]
[0.68819022 0.68819022 0.68819022]
[0.19104546 0.19104546 0.19104546]
[0.68420365 0.68420365 0.68420365]]
print (np.corrcoef(x,y1))
[[ 1.00000000e 00 -9.96776982e-01 3.52933703e-01 -9.66910777e-01
9.23044315e-01 nan -3.11624591e-16 nan
nan 3.11624591e-16]
[-9.96776982e-01 1.00000000e 00 -4.26856227e-01 9.43328464e-01
-8.89208247e-01 nan 0.00000000e 00 nan
nan 0.00000000e 00]
[ 3.52933703e-01 -4.26856227e-01 1.00000000e 00 -1.02557684e-01
-3.41645099e-02 nan 9.18680698e-17 nan
nan -9.18680698e-17]
[-9.66910777e-01 9.43328464e-01 -1.02557684e-01 1.00000000e 00
-9.90642527e-01 nan -9.92012638e-17 nan
nan 9.92012638e-17]
[ 9.23044315e-01 -8.89208247e-01 -3.41645099e-02 -9.90642527e-01
1.00000000e 00 nan 6.00580887e-16 nan
nan -6.00580887e-16]
[ nan nan nan nan
nan nan nan nan
nan nan]
[-3.11624591e-16 0.00000000e 00 9.18680698e-17 -9.92012638e-17
6.00580887e-16 nan 1.00000000e 00 nan
nan -1.00000000e 00]
[ nan nan nan nan
nan nan nan nan
nan nan]
[ nan nan nan nan
nan nan nan nan
nan nan]
[ 3.11624591e-16 0.00000000e 00 -9.18680698e-17 9.92012638e-17
-6.00580887e-16 nan -1.00000000e 00 nan
nan 1.00000000e 00]]
CodePudding user response:
@Jezrael pretty much answered my question. An alternate approach would be create a array of zeros with 9 columns and use it correlation coefficient of each feature in x with y. And we do this iteratively.
coeffs = np.zeros(9)
#number of rows
n_features = x.shape[1]
for feature in range(n_features):
#corrcoef returns a 2-d of shape (2,2) with 1s along the diagonal and coefficient values at 0,1 and 1,0
coeff = np.corrcoef(x[:,feature],y)[0,1]
coeffs[feature] = coeff