I have a huge dataset and created a large correlation matrix. My goal is to clean this up and create a new data frame with all the correlations greater than the abs(.25) with the variable names include. For example, I have this data set, how would I use a double nested loop over the rows and columns of the table of correlation.
a <- rnorm(10, 0 ,1)
b <- rnorm(10,1,1.5)
c <- rnorm(10,1.5,2)
d <- rnorm(10,-0.5,1)
e <- rnorm(10,-2,1)
matrix <- data.frame(a,b,c,d,e)
cor(matrix)
(notice, that there is redundancy in the matrix. You only need to inspect the first 5 columns; and you don’t need to inspect all rows. If I’m looking at column 3, for example, I only need to start looking at row 4, after the correlation = 1) Thank you
CodePudding user response:
Is your ultimate goal to create a 5x5 with all values with absolute less than 0.25 set to zero? This can be done via sapply(matrix,function(x) ifelse(x<0.25,0,x))
. If your goal is to simply create a loop over the rows and columns, this can be done via:
m <- cor(matrix)
for (row in rownames(m)){
for (col in colnames(m)){
#your code here
#operating on m[row,col]
}
}
To avoid redundancy:
for (row in rownames(m)[1:(length(rownames(m))-1)]){
for (col in colnames(m)[(which(colnames(m) == row) 1):length(colnames(m))]){
#your code here
#operating on m[row,col]
print(m[row,col])
}
}
CodePudding user response:
I'd suggest using the corrr package, in conjunction with tidyr
and dplyr
.
This allows you to generate a correlation data frame rather than a matrix and remove the duplicate values (where for example a-b is the same as b-a) using the shave
function. You can then rearrange by pivoting, remove the NA values (from the diagonal, e.g. a-a) and filter for values greater than 0.25.
library(dplyr)
library(tidyr)
library(magrittr) # for the pipe %>% or just use library(tidyverse) instead of all 3
library(corrr)
# for reproducible values
set.seed(1001)
# no need to make a data frame from vectors
# and don't call it matrix, that's a function name
mydata <- data.frame(a = rnorm(10, 0 ,1),
b = rnorm(10, 1, 1.5),
c = rnorm(10, 1.5, 2),
d = rnorm(10, -0.5, 1),
e = rnorm(10, -2, 1))
mydata %>%
correlate() %>%
shave() %>%
pivot_longer(2:6) %>%
na.omit() %>%
filter(abs(value) > 0.25)
Result:
# A tibble: 4 x 3
term name value
<chr> <chr> <dbl>
1 c b -0.296
2 d b 0.357
3 e a -0.440
4 e d -0.280