Home > Mobile >  Correlations of a variable with multiple variables
Correlations of a variable with multiple variables

Time:08-21

My input data df is:

Action          Difficulty  strings characters  POS NEG NEU
Field           0.635       7       59          0   0   7
Field or Catch  0.768       28      193         0   0   28
Field or Ball   -0.591      108     713         6   0   101
Ball            -0.717      61      382         3   0   57
Catch           -0.145      89      521         1   0   88
Field           0.28        208     1214        2   3   178
Field and run   1.237       18      138         1   0   17

I am interested in group-based correlations of Difficulty with the remaining variables strings, characters, POS, NEG, NEU. The grouping variable is Action. If I am interested only in the group Field, I can do filter(str_detect(Action, 'Field')).

I can do it one by one between Difficulty and the remaining variables. But is there a faster way to do it in one command with multiple variables? My partial solution is:

df %>%
 filter(str_detect(Action, 'Field')) %>%
 na.omit %>%   # Original data had multiple NA
 group_by(Action) %>%
 summarise_all(funs(cor))

But this results in an error.

Some relevant SO posts that I looked at are: This is quite relevant to generate a correlation matrix but does not address my question Find correlation coefficient of two columns in a dataframe by group. Useful to compute different types of correlations and introduces a different way of ignoring NAs: Check the correlation of two columns in a dataframe (in R)

Any help or guidance on this would be greatly appreciated!

For reference, this is the sample dput()

structure(list(
Action = c("Field", "Field or Catch", "Field or Ball", "Ball", "Catch", "Field", "Field and run"), Difficulty = c(0.635, 0.768, -0.591, -0.717, -0.145, 0.28, 1.237), 
strings = c(7L, 28L, 108L, 61L, 89L, 208L, 18L), 
characters = c(59L, 193L, 713L, 382L, 521L, 1214L, 138L), 
POS = c(0L, 0L, 6L, 3L, 1L, 2L, 1L), 
NEG = c(0L, 0L, 0L, 0L, 0L, 3L, 0L), 
NEU = c(7L, 28L, 101L, 57L, 88L, 178L, 17L)), 
class = "data.frame", row.names = c(NA, 
-7L))

CodePudding user response:

You may try -

library(dplyr)
library(stringr)

df %>%
  filter(str_detect(Action, 'Field')) %>%
  na.omit %>%   # Original data had multiple NA
  group_by(Action) %>%
  summarise(across(-Difficulty, ~cor(.x, Difficulty)))

If you don't want to group_by Action -

df %>%
  filter(str_detect(Action, 'Field')) %>%
  na.omit %>%  
  summarise(across(-c(Difficulty, Action), ~cor(.x, Difficulty)))

#    strings characters        POS        NEG        NEU
#1 -0.557039 -0.5983826 -0.8733465 -0.1520684 -0.5899733
  • Related