I am trying to create a random vector whose components are uncorrelated standard normal variables with zero mean and unit variance. I am using the function
Are these random variables uncorrelated? Because when I am trying to find covariance coefficient:
import numpy as np
print(np.random.normal(0,1,size=17))
print(np.cov(np.random.normal(0,1,size=17)))
Also, Python is not giving me exactly zero coefficient of covariance (my result is close to 0.9).
CodePudding user response:
Those are independent and identically distributed variates drawn from a standard uniform distribution. Therefore, you should expect them to be uncorrelated. If you want IID standard normal variates you want to call np.random.normal(size=n)
.
You should not expect to get exactly zero from any correlation test – this is a random set of values drawn from that distribution, sometimes they might happen to be autocorrelated.
Update: I've just had a play with Numpy's cov
function and think you're using it wrong. At the moment you're telling it to test for correlations between a value and itself, which will trivially be the identity. I think corrcoef
might be better method to use as it's normalised to [-1,1] which makes interpreting values easier.
import numpy as np
import scipy.stats as sps
n = 100
for _ in range(20):
obs = np.random.normal(size=n)
# calculate autocorrelation
s1 = np.corrcoef(obs[1:], obs[:-1])[0,1]
# calculate correlation with index
s2 = np.corrcoef(obs, np.arange(n))[0,1]
print(f" {s1: .3f} {s2: .3f}")
# 95% of the values above should be smaller than this
ts = sps.norm.ppf((1 - 0.95) / 2) / np.sqrt(n)
print(f"\n{t:.0%} cutoff for n={n} is $abs(s) < {np.abs(ts):.3f}$")
If I run this I get the following output:
0.118 -0.152
-0.005 -0.145
-0.177 0.143
0.091 0.186
-0.183 0.083
-0.009 0.027
-0.124 -0.083
0.111 0.167
-0.207 -0.093
-0.025 -0.191
-0.117 -0.001
-0.019 -0.026
-0.177 -0.057
0.084 -0.123
-0.144 0.088
-0.018 0.245
0.117 0.002
0.069 0.084
0.025 0.043
-0.101 -0.112
95% cutoff for n=100 is $abs(s) < 0.196$
note that two values are greater than the test statistic, but that's expected. Re-running with n=100_000
gives a much tighter bound, but you should still get the same failure rate.
Your update says that your n=17
, so you might want to switch to Student's t-distribution for the cutoff. Something like:
ts = sps.t(n-2).ppf((1 - 0.95) / 2) / np.sqrt(n)
i.e. just changing sps.norm
to sps.t(n)
. Note that changing from Gaussian to t only widens the cutoff for n=17
a bit, from 0.475 to 0.517.
CodePudding user response:
I agree with Sam Mason that you're calculating correlation incorrectly. You actually want serial correlation, a.k.a. autocorrelation, which measures the correlation between pairs of values at different lags in a sequence. Also note that in general correlation and covariance are not the same thing. Correlation at lag k is covariance at lag k scaled by the covariance at lag 0 (which is generally thought of as the variance in non-series contexts). That scaling results in a unitless value between -1 and 1. Covariance, on the other hand, is neither unitless nor bounded in magnitude.
Since you're working with sample data you need to be aware that the sample mean and sample variance are not going to be the same as the theoretical mean and variance of the population. To get good estimates of autocorrelation you need to "standardize" the data by subtracting the mean and dividing by the standard deviation. If you don't do the standardization, it's possible to get correlation estimates which are greater than 1.
Here's code which will calculate the lag k autocorrelations for k from 0 to min(sample size, 10). The lag 0 correlation is always 1 since each observation is perfectly correlated with itself. It is included as a sanity check.
import numpy as np
n = 100
my_array = np.random.normal(0, 1, size=n)
my_mean = np.mean(my_array) # sample mean of the data
my_var = np.var(my_array) # sample variance of the data
my_array = (my_array - my_mean) / np.sqrt(my_var) # standardized
# Estimate correlations between all pairings of the standardized data, pick off
# the appropriate subset of lag k correlations, and average the results at
# each lag. Scaling by n rather than (n-k) results in a biased estimator, but
# increases stability of the estimator and has relatively low impact for k
# small relative to n.
#
# Ideally the outcome should be a one followed by all zeros if the data are
# independent. In practice sampling variance will yield non-zero results, but
# they should converge towards zero as n is increased.
print(np.correlate(my_array, my_array, mode = 'full')[n - 1 : min(2*n, n 10)] / n)
This gives results such as:
[ 1. -0.12001073 0.13843158 -0.0488547 -0.0372176 0.13162176 -0.00577363 -0.04491521 0.00122878 0.16740437 -0.07059456]
for n = 100, and
[ 1. 0.00129255 -0.02647235 -0.01868428 -0.00134764 -0.00246469 0.04393734 0.00669801 -0.04666181 -0.00369125 -0.02117044]
for n = 1000.