I am trying to calculate the covariance matrix of two vectors a and b for which i am using numpys cov implementation R = np.cov(a,b). I got a little confussed when i noticed that np.cov(a,b)[0,0] != np.var(a) however i was able to find that, that had to do with biased vs unbiased estimators and is controlled by ddof.
However, that isn't the end of it. Why is R[0,1] != R[0,0]**0.5 * R[1,1]**0.5. Following my understanding and the definition of the covariance matrix on wikipedia https://en.wikipedia.org/wiki/Covariance_matrix
R[0,1] = R[1,0] = std(a) * std(b)
R[0,1] = R[1,0] = var(a)**0.5 * var(b)**0.5
R[0,1] = R[1,0] = R[0,0]**0.5 * R[1,1]**0.5
Where am i mistaken?
import numpy as np
rng = np.random.default_rng(seed=43)
a = rng.random((1,3))
b = rng.random((1,3))
R = np.cov(a,b,ddof=1)
print(R)
print('var a: ' str(np.var(a,ddof=1)))
print('var b: ' str(np.var(b,ddof=1)))
print('cov a,b: ' str(np.var(a,ddof=1)**0.5*np.var(b,ddof=1)**0.5))
print('cov a,b: ' str(R[0,0]**0.5*R[1,1]**0.5))
print('cov a,b: ' str(np.std(a,ddof=1)*np.std(b,ddof=1)))
I apologist in advance for any spelling or stack overflow ethnic mistakes on my part. Any help is appreciated.
CodePudding user response:
I'm not sure where the formula var(a)**0.5 * var(b)**0.5
is coming from, but it's not the formula I've seen for cross-covariance. I've seen this as the expected value of the products of x - mean_of_x
and y - mean_of_y
.
In a loop style (for clarity) this might look like:
a_mean = np.mean(a)
b_mean = np.mean(b)
s = 0
n = len(a[0])
for i, _ in enumerate(a[0]):
s = (a[0][i] - a_mean) * (b[0][i] - b_mean)
s / (n-1)
# 0.09175517729176114
In Numpy you can also do:
a_mean = np.mean(a)
b_mean = np.mean(b)
(a - a_mean) @ (b - b_mean).T / (n-1)
# array([[0.09175518]])
This corresponds to the values you get in the corners.
If you want to divide by n
rather than n-2
, you can pass in the bias
arg to cov()
np.cov(a, b, bias=True)
# array([[0.08562558, 0.06117012],
[0.06117012, 0.06361328]])
The corners here are the results you will get with the above code by dividing the results by 3
(n)
rather than 2
(n-1)