Home > OS >  Does numpy.corrcoef calculate correlation within matrices when two matrices are provided (intra-corr
Does numpy.corrcoef calculate correlation within matrices when two matrices are provided (intra-corr

Time:07-27

Here I'm calculating the Pearson correlation such that I'm accounting for every comparison.

x = pd.DataFrame({'a':[3,6,4,7,9],'b':[6,2,4,1,5],'c':[7,9,1,2,9]},index=['aa','bb','cc','dd','ee']).T
y = pd.DataFrame({'A':[9,4,1,3,5],'B':[9,8,9,5,7],'C':[1,1,3,1,2]},index=['aa','bb','cc','dd','ee']).T

table = pd.DataFrame(columns=['Correlation Coeff'])
for i in range(0, len(x)):
    for j in range(0, len(y)):
        xf = list(x.iloc[i])
        yf = list(y.iloc[j])
        n = np.corrcoef(xf,yf)[0,1]
        name = x.index[i] '|' y.index[j]
        table.at[name, 'Correlation Coeff'] = n
table  

This is the result:

       Correlation Coeff
a|A   -0.232973
a|B   -0.713392
a|C   -0.046829
b|A    0.601487
b|B    0.662849
b|C    0.29654
c|A    0.608993
c|B    0.16311
c|C   -0.421398

Now when I just apply these tables directly to numpy's function, removing duplicate values and 'ones' it looks this this.

x = pd.DataFrame({'a':[3,6,4,7,9],'b':[6,2,4,1,5],'c':[7,9,1,2,9]},index=['aa','bb','cc','dd','ee']).T.to_numpy()
y = pd.DataFrame({'A':[9,4,1,3,5],'B':[9,8,9,5,7],'C':[1,1,3,1,2]},index=['aa','bb','cc','dd','ee']).T.to_numpy()
n = np.corrcoef(x,y)
n = n.tolist()
n = [element for sub in n for element in sub]
# Rounding to ensure no duplicates are being picked up.
rnd = [round(num, 13) for num in n] 
X = [i for i in rnd if i != 1]
X =  list(dict.fromkeys(X))
X

[-0.3231828652987, 0.3157400783243, -0.232972779074, -0.7133922984085, -0.0468292905791, 0.3196502842345, 0.6014868821052, 0.6628489803599, 0.2965401263095, 0.608993434846, 0.1631095635753, -0.4213976904463, 0.2417468892076, -0.5841782301194, 0.3674842076296]

There are 6 extra values (in bold) not accounted for. I'm assuming that they are correlation values calculated within a single matrix and if so, why? Is there a way to use this function without generating these additional values?

CodePudding user response:

You are right in assuming that those are the correlations from variables within x and y, and so far as I can tell there is no way to turn this behaviour off.

You can see that this is true by looking at the implementation of numpy.corrcoef. As expected, most of the heavy lifting is being done by a separate function that computes covariance - if you look at the implementation of numpy.cov, particularly line 2639, you will see that, if you supply an additional y argument, this is simply concatenated onto x before computing the covariance matrix.

If necessary, it wouldn't be too hard to implement your own version of corrcoef that works how you want. Note that you can do this in pure numpy, which in most cases will be faster than the iterative approach from the example code above.

  • Related