I have a few dichotomous variables (around 15) that I want to correlate with each other.
For the sake of simplicity, I will describe my problem using a similar, more simple data set.
Let's assume we have a data frame containing 5 Variables
var1 <- c(1,0,0,1,NA,1,0,0,1,NA)
var2 <- c(1,NA,1,1,NA,1,NA,1,1,NA)
var3 <- c(NA,0,0,1,NA,NA,0,0,1,NA)
var4 <- c(0,0,0,NA,1,0,0,0,NA,1)
var5 <- c(1,1,0,1,NA,1,1,0,1,NA)
DF <- data.frame(var1, var2, var3, var4, var5)
Since I only have binary variables, I cannot use a pearson correlation.
I've read, that a chi square test or a phi correlation would fit my problem, but I've only found instructions with 2 variables resp. a 2x2 frame, not multiple variables.
Is there a way to correlate several binary variables with one another and to represent them using a matrix?
Thank you very much in advance for your answer!
CodePudding user response:
The value of phi equals the correlation so you can use cor to get it.
cor(var1, var5, "pair")
## [1] 0.5773503
library(psych)
phi(table(var1, var5), 7)
## [1] 0.5773503
cor(DF, use = "pair")
## var1 var2 var3 var4 var5
## var1 1.0000000 NA 1.0 NA 0.5773503
## var2 NA NA NA NA NA
## var3 1.0000000 NA 1.0 NA 0.5000000
## var4 NA NA NA 1 NA
## var5 0.5773503 NA 0.5 NA 1.0000000
## Warning message:
## In cor(DF, use = "pair") : the standard deviation is zero
CodePudding user response:
Here are two measures of similarity between binary variables, Jaccard distance and accuracy.
jaccard <- function(x, y){
x <- factor(x, levels = 0:1)
y <- factor(y, levels = 0:1)
tbl <- table(x, y)
tbl[2, 2]/(tbl[1, 2] tbl[2, 1] tbl[2, 2])
}
sapply(DF, \(X) sapply(DF, \(Y) jaccard(X, Y)))
sapply(DF, \(X) sapply(DF, \(Y) mean(X == Y, na.rm = TRUE)))