Number of observations used by cor function in R-CodePudding

I have a big matrix in R with more than 2000 columns and 10,000 rows, and many missing values. This line of code calculates the correlation matrix in R.

cor(data, use = "complete.obs")

My question is: how can I find the number of observations that have been used to calculate each correlation in the output matrix?

The output should be something like this:

	v1	v2	v3	v4
v1	20	12	15	18
v2	12	15	10	11
v3	15	10	25	20
v4	18	11	20	20

Thanks for any suggestion

CodePudding user response：

Let's use a sample matrix data filled with random NAs:

library(dplyr)

set.seed(1234)
data <- rnorm(100) %>%
    matrix(nrow = 10) %>%
    {
        m <- .
        m[rnorm(100) > .5] <- NA
        m

    }


            [,1]       [,2]       [,3]        [,4]       [,5]       [,6]
 [1,] 0.48522682         NA  0.8951720 -0.32439330 0.05913517  0.4369306
 [2,] 0.69676878 -0.4002352  0.6602126          NA 0.41339889         NA
 [3,] 0.18551392  1.4934931  2.2734835 -0.93350334         NA  0.4521904
 [4,]         NA -1.6070809  1.1734976          NA         NA  0.6631986
 [5,] 0.31168103 -0.4157518  0.2877097  0.31916024 0.71888873 -1.1363736
 [6,] 0.76046236         NA -0.6597701 -1.07754212         NA         NA
 [7,] 1.84246363 -0.1517365         NA -3.23315213 1.35727444         NA
 [8,]         NA         NA  0.6774155          NA 0.40446847 -1.2239038
 [9,] 0.03266396 -0.3047211         NA  0.02951783 0.26436427  0.2580684
[10,]         NA  0.6295361  0.1864921  0.59427377 0.26804390         NA
            [,7]       [,8]       [,9]      [,10]
 [1,]         NA -0.3046139 -1.0118219         NA
 [2,]         NA  1.8250111  0.4701675  0.1832475
 [3,]  0.1586254  0.6705594 -0.7009703 -1.7662292
 [4,] -1.7632551  0.9486326         NA         NA
 [5,]  0.3385960  2.0494030         NA         NA
 [6,]         NA -0.6511136         NA         NA
 [7,] -0.2386466  0.8086193         NA -1.1750368
 [8,] -1.1877653  0.9865806 -0.2457632         NA
 [9,]  0.3849353         NA -1.5528590  0.3536254
[10,]         NA  0.3190524  0.1284340  0.3191562

You can transform it into a logical matrix dna where dna[i,j] == TRUE means that data[i,j] is not NA:

dna <- !is.na(data)

Then you can perform matrix product of dna with t(dna) to obtain the number of non-missing observations.

dna <- !is.na(data)

dna %*% t(dna)

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    8    7    4    6    4    3    4    8    5     7
 [2,]    7    9    6    6    5    4    6    8    6     8
 [3,]    4    6    6    4    4    3    4    5    4     5
 [4,]    6    6    4    7    3    3    3    6    5     6
 [5,]    4    5    4    3    5    2    4    5    3     4
 [6,]    3    4    3    3    2    5    4    4    3     5
 [7,]    4    6    4    3    4    4    6    5    4     6
 [8,]    8    8    5    6    5    4    5    9    5     8
 [9,]    5    6    4    5    3    3    4    5    6     5
[10,]    7    8    5    6    4    5    6    8    5     9