Rolling correlations across multiple columns, some with NAs?-CodePudding

I have the below dataset, where I am trying to do a rolling 3 days correlation across x,y,z,a. So the code should do rolling correlations of xy,xz,xa, yx, yz,ya and so on. Also, as you can see below, the data for y and a is incomplete, but I would wish to do rolling correlations of them starting from the date where they first had values (i.e. id 3 and id 4).

How should I accomplish this? Don't know where to start...

set.seed(42) 
n <- 10
dat <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-22"), as.Date("2020-12-31"), "day"),
                  x=rnorm(n),
                  y=rnorm(n),
                  z=rnorm(n),
                  a=rnorm(n))
dat$y[1:2] <- NA
dat$a[1:3] <- NA

I am able to find this set of code from stack, but it only helps in finding the answer for 1st column and not all the columns

rollapplyr(x, 5, function(x) cor(x[, 1], x[, -1]), by.column = FALSE)

CodePudding user response：

combn produces all the combinations.

cols <- c("x", "y", "z", "a")

combn(cols, 2)
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "x"  "x"  "x"  "y"  "y"  "z" 
# [2,] "y"  "z"  "a"  "z"  "a"  "a"

combn has a function argument where you first na.omit all rows with NA's. Then subset with mapply over incrementing sequences 1:3 and calculate correlations, until nrow is reached.

w <- 3  ## size of the rolling window
combn(dat[cols], 2, function(x) {
  X <- na.omit(x)
  n <- nrow(X)
  mapply(function(y, z) cor(X[y   z, 1], X[y   z, 2]), list(1:w), 0:(n - w))
}, simplify=FALSE)
# [[1]]
# [1]  0.5307784 -0.9874843 -0.8364802  0.2407730  0.3655328 -0.4458231
# 
# [[2]]
# [1]  0.8121466  0.9652715  0.3304100  0.8278965 -0.1425097  0.5832558  0.9959705
# [8]  0.8696023
# 
# [[3]]
# [1]  0.6733985  0.2194488  0.5593983 -0.6589249 -0.9291184
# 
# [[4]]
# [1]  0.97528684 -0.90599558 -0.42319742  0.92882443  0.28058418  0.05427966
# 
# [[5]]
# [1] -0.7815678 -0.7182037 -0.6698260  0.4592962  0.7452225
# 
# [[6]]
# [1]  0.9721521  0.9343926 -0.3470329 -0.7237291 -0.6253825

CodePudding user response：

Create a data frame with only the columns wanted and then use rollapplyr with cor. cor takes a use= argument that specifies how missing values are to be handled. See ?cor for the values it can take since you may or may not wish to use the value we used below.

The result r is a matrix whose i-th row describes the correlation matrix of the 5 dat2 rows ending in and including row i. That is, matrix(r[i, ], 4, 4) is the correlation matrix of dat2[i-(4:0), ].

We can also create ar which is a 3d array which is such that ar[i,,] is the correlation matrix of the 5 rows of dat2 ending in and including row i.

That is these are equal for each i in 5, ..., nrow(dat2). (The first 4 rows of r are all NA since there do not exist 5 rows leading to those rows.)

1. cor(dat2[i-(4:0), ], use = "pairwise")
2. matrix(r[i, ], 4, 4)
3. ar[i,,]

We run checks for these equivalences for i=5 below.

library(zoo)
w <- 5
dat2 <- dat[c("x", "y", "z", "a")]
nr <- nrow(dat2)
nc <- ncol(dat2)


r <- rollapplyr(dat2, w, cor, use = "pairwise", by.column = FALSE, fill = NA)
colnames(r) <- paste(names(dat2)[c(row(diag(nc)))], 
                     names(dat2)[c(col(diag(nc)))], sep = ".")

ar <- array(r, c(nr, nc, nc), 
  dimnames = list(NULL, names(dat2), names(dat2)))

# run some checks

cor5 <- cor(dat2[1:w, ], use = "pairwise") # cor of 1st w rows

# same except for names
all.equal(unname(cor5), matrix(r[w, ], nc))
## [1] TRUE

all.equal(cor5, ar[w,,])
## [1] TRUE

The above shows a matrix whose rows are strung out correlation matrices and a 3d array whose slices are correlation matrices. Another possibility for output is to create a list of correlation matrices.

lapply(1:nr, function(i) {
  if (i >= w) cor(dat2[i-((w-1):0), ], use = "pairwise")
})