Correlations between columns with NAs in them-CodePudding

I am trying to do compute some correlations between the first 3 countries below (Germany, Italy, Japan) vs the 2nd 3 countries (US, Canada, UK).

So for example, it would be Germany vs US, Germany vs Canada, Germany vs UK, then Italy vs US, Italy, vs Canada, Italy vs UK and so on.

However, as my data do have some gaps in the first few rows (like in Germany & Japan), the usual corr function would not work. So in this case, the data should skip the first 2 rows of Germany & first 3 of Japan and start on row 3 & 4 for Germany & Japan, to compare with similar rows for US/Canada/UK, while having the correlation for full data for Italy.

So I was wondering how should I do this?

Thanks

df11 <- 
     tibble(
             date = 2001:2010,
             Germany = runif(10),
             Italy = runif(10),
             Japan = runif(10),
             US = runif(10),
             Canada = runif(10),
             UK = runif(10)
            )
    df11$Germany[1:2] <- NA
    df11$Japan[1:3] <- NA

CodePudding user response：

From mycor function you can find in r-bloggers,

mycor<- function(x,...){
  r<- apply(x, 2, function(j){
    apply(x, 2, function(i){
      as.numeric(cor.test(i,j)$estimate)
    })
  })
  P<- apply(x, 2, function(j){
    apply(x, 2, function(i){
      as.numeric(cor.test(i,j)$p.value)
    })
  })
  out<-c()
  out$P<- P
  out$r<- r
  return(out) 
}
mycor(df11)$r

               date    Germany       Italy       Japan          US      Canada         UK
date     1.00000000  0.3829687 -0.09309048 -0.46562050 -0.44324591  0.41293491  0.7908250
Germany  0.38296868  1.0000000  0.32186956 -0.19135611 -0.49111087 -0.38151625  0.6377928
Italy   -0.09309048  0.3218696  1.00000000  0.04341171  0.09589073 -0.32724552  0.2138135
Japan   -0.46562050 -0.1913561  0.04341171  1.00000000  0.52800797  0.25226383 -0.4802936
US      -0.44324591 -0.4911109  0.09589073  0.52800797  1.00000000  0.08124373 -0.4869257
Canada   0.41293491 -0.3815163 -0.32724552  0.25226383  0.08124373  1.00000000  0.1033160
UK       0.79082499  0.6377928  0.21381349 -0.48029364 -0.48692567  0.10331600  1.0000000

CodePudding user response：

Using cor in combn. Instead of df[-1], which just excludes the date column, you could explicitly define e.g. df11[c("Germany", "Italy", "Japan")] to make a specific selection of countries. use="complete.obs" includes just complete observations, see documentation ?cor for other options.

res <- combn(df11[-1], 2, cor, use="complete.obs", simplify=FALSE)

Result

res
# [[1]]
#             Germany       Italy
# Germany  1.00000000 -0.08586634
# Italy   -0.08586634  1.00000000
# 
# [[2]]
#           Germany     Japan
# Germany 1.0000000 0.3699611
# Japan   0.3699611 1.0000000
# 
# [[3]]
#            Germany         US
# Germany 1.00000000 0.07002937
# US      0.07002937 1.00000000
# 
# [[4]]
#           Germany    Canada
# Germany 1.0000000 0.3949677
# Canada  0.3949677 1.0000000
# 
# [[5]]
#           Germany        UK
# Germany 1.0000000 0.5248062
# UK      0.5248062 1.0000000
# 
# [[6]]
#            Italy      Japan
# Italy 1.00000000 0.09700777
# Japan 0.09700777 1.00000000
# 
# [[7]]
#            Italy         US
# Italy  1.0000000 -0.1351394
# US    -0.1351394  1.0000000
# 
# [[8]]
#             Italy     Canada
# Italy   1.0000000 -0.1587657
# Canada -0.1587657  1.0000000
# 
# [[9]]
#            Italy         UK
# Italy  1.0000000 -0.5379418
# UK    -0.5379418  1.0000000
# 
# [[10]]
#           Japan        US
# Japan  1.000000 -0.744641
# US    -0.744641  1.000000
# 
# [[11]]
#            Japan    Canada
# Japan  1.0000000 0.5813378
# Canada 0.5813378 1.0000000
# 
# [[12]]
#            Japan         UK
# Japan  1.0000000 -0.1877573
# UK    -0.1877573  1.0000000
# 
# [[13]]
#                US     Canada
# US      1.0000000 -0.5947739
# Canada -0.5947739  1.0000000
# 
# [[14]]
#            US         UK
# US 1.00000000 0.01007044
# UK 0.01007044 1.00000000
# 
# [[15]]
#           Canada        UK
# Canada 1.0000000 0.3784406
# UK     0.3784406 1.0000000

If you are interested in the coefficients rather than the matrices, just do.

sapply(res, `[`, 2)
# [1] -0.08586634  0.36996112  0.07002937  0.39496772  0.52480618  0.09700777
# [7] -0.13513936 -0.15876568 -0.53794180 -0.74464099  0.58133779 -0.18775725
# [13] -0.59477393  0.01007044  0.37844058

Or if you want the matrix,

cor(df11[-1], use="complete.obs")
#            Germany       Italy       Japan          US     Canada         UK
# Germany  1.00000000  0.05086447  0.36996112 -0.02672511  0.3531181  0.4261229
# Italy    0.05086447  1.00000000  0.09700777 -0.03657892 -0.1514971 -0.6038602
# Japan    0.36996112  0.09700777  1.00000000 -0.74464099  0.5813378 -0.1877573
# US      -0.02672511 -0.03657892 -0.74464099  1.00000000 -0.6574211  0.2106356
# Canada   0.35311813 -0.15149706  0.58133779 -0.65742113  1.0000000  0.2452463
# UK       0.42612287 -0.60386019 -0.18775725  0.21063562  0.2452463  1.0000000

If you aim to use cor.test you can use the names in combn and paste them as.formula using sprintf. See above how to get just a subset.

res2 <- combn(names(df11[-1]), 2, \(x) 
              cor.test(as.formula(sprintf('~ %s   %s', x[1], x[2])), data=df11), 
              simplify=F)

Result 2

res2[[1]]  ## first result
# Pearson's product-moment correlation
# 
# data:  Germany and Italy
# t = -0.21111, df = 6, p-value = 0.8398
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  -0.7454347  0.6586606
# sample estimates:
#         cor 
# -0.08586634