Home > OS >  Finding correlations of columns
Finding correlations of columns

Time:11-30

Preface: I am a beginner to R that is eager to learn. Please don't mistake the simplicity of the question (if it is a simple answer) for lack of research or effort!

I have a data.table named data with columns labelled V1 through V20. I want to take the correlation of certain columns (1,6,7,9,10 and 11) relative to V18 such that I have a new table that looks like the following

Variable     Correlation_to_V18
V1              cor(V1,V18)
V6              cor(V6,V18)
V7              cor(V7,V18)
V9              cor(V9,V18)
V10             cor(V10,V18)
V11             cor(V11,V18)

I tried using a for loop to no avail

column <- c(1,6,7,9,10,11)
for (i in column) {
  correlations<-cor(data[,18], data[,as.numeric(i)])
  cor_table<- data.table(variables = colnames(data[,as.numeric(column)]), correlation_val = correlations)
  return(cor_table)
}

CodePudding user response:

Let's make some data (50 rows x 20 columns) that sounds like your data:

library(dplyr)
library(tidyr)
library(corrr)

set.seed(123)
df1 <- data.frame(matrix(rnorm(1000), 
                  ncol = 20, 
                  nrow = 50, 
                  dimnames = list(c(), paste0("V", 1:20))))

In general when using R, processing dataframes using loops is a bad idea. There are apply functions, tidyverse packages and other packages to make working with dataframes easier.

In this case I would use the corrr library to return a dataframe with the correlation coefficients. You can then filter for the term of interest V18, reshape the output from "wide" to "long", and filter again for the other variables.

correlate(df1) %>% 
  filter(term == "V18") %>% 
  pivot_longer(2:21) %>% 
  filter(name %in% c("V1", "V6", "V7", "V9", "V10", "V11"))

Result:

# A tibble: 6 x 3
  term  name    value
  <chr> <chr>   <dbl>
1 V18   V1     0.125 
2 V18   V6    -0.167 
3 V18   V7     0.345 
4 V18   V9    -0.110 
5 V18   V10    0.0473
6 V18   V11   -0.0347

CodePudding user response:

Here is another potential solution:

library(data.table)
set.seed(123)
mat <- matrix(runif(400), nrow = 20, ncol = 20)
data <- data.table(mat)

columns_of_interest <- c(1,6,7,9,10,11,18)
data_subset <- data[, columns_of_interest, with=FALSE]
data_cor_values <- data_subset[, .(Correlation_to_V18 = cor(.SD, V18))]
data_cor_values[, variable := columns_of_interest]
data_cor_values
#>    Correlation_to_V18.V1 variable
#> 1:          -0.049812188        1
#> 2:          -0.375566877        6
#> 3:           0.089879501        7
#> 4:          -0.022499113        9
#> 5:          -0.007267059       10
#> 6:          -0.178489961       11
#> 7:           1.000000000       18

# Trim the last row (V18)
data_cor_values[1:.N-1,]
#>    Correlation_to_V18.V1 variable
#> 1:          -0.049812188        1
#> 2:          -0.375566877        6
#> 3:           0.089879501        7
#> 4:          -0.022499113        9
#> 5:          -0.007267059       10
#> 6:          -0.178489961       11

# Check the answer is correct:
cor(data_subset$V1, data_subset$V18)
#> [1] -0.04981219
cor(data_subset$V6, data_subset$V18)
#> [1] -0.3755669

Created on 2021-11-30 by the reprex package (v2.0.1)

  • Related