Home > Software engineering >  Correlation Matrix with multiple binary variables
Correlation Matrix with multiple binary variables

Time:11-10

I have a few dichotomous variables (around 15) that I want to correlate with each other.

For the sake of simplicity, I will describe my problem using a similar, more simple data set.

Let's assume we have a data frame containing 5 Variables

var1 <- c(1,0,0,1,NA,1,0,0,1,NA)
var2 <- c(1,NA,1,1,NA,1,NA,1,1,NA)
var3 <- c(NA,0,0,1,NA,NA,0,0,1,NA)
var4 <- c(0,0,0,NA,1,0,0,0,NA,1)
var5 <- c(1,1,0,1,NA,1,1,0,1,NA)

DF <- data.frame(var1, var2, var3, var4, var5)

Since I only have binary variables, I cannot use a pearson correlation.

I've read, that a chi square test or a phi correlation would fit my problem, but I've only found instructions with 2 variables resp. a 2x2 frame, not multiple variables.

Is there a way to correlate several binary variables with one another and to represent them using a matrix?

Thank you very much in advance for your answer!

CodePudding user response:

The value of phi equals the correlation so you can use cor to get it.

cor(var1, var5, "pair")
## [1] 0.5773503

library(psych)
phi(table(var1, var5), 7)
## [1] 0.5773503

cor(DF, use = "pair")
##           var1 var2 var3 var4      var5
## var1 1.0000000   NA  1.0   NA 0.5773503
## var2        NA   NA   NA   NA        NA
## var3 1.0000000   NA  1.0   NA 0.5000000
## var4        NA   NA   NA    1        NA
## var5 0.5773503   NA  0.5   NA 1.0000000
## Warning message:
## In cor(DF, use = "pair") : the standard deviation is zero

CodePudding user response:

Here are two measures of similarity between binary variables, Jaccard distance and accuracy.

jaccard <- function(x, y){
  x <- factor(x, levels = 0:1)
  y <- factor(y, levels = 0:1)
  tbl <- table(x, y)
  tbl[2, 2]/(tbl[1, 2]   tbl[2, 1]   tbl[2, 2])
}

sapply(DF, \(X) sapply(DF, \(Y) jaccard(X, Y)))
sapply(DF, \(X) sapply(DF, \(Y) mean(X == Y, na.rm = TRUE)))
  • Related