Home > OS >  dplyr: correlations with NA
dplyr: correlations with NA

Time:05-03

xx <- data.frame(group = rep(1:4, each=100), a = rnorm(100) , b = rnorm(100))
xx[c(1,14,33), 'b'] = NA

I'm trying to calculate correlations by group but I'm getting an error when there are NAs.

library(dplyr)
xx %>% group_by(group) %>% summarize(COR=cor(a,b,na.rm=TRUE))
    
Error: Problem with `summarise()` column `COR`.
    i `COR = cor(a, b, na.rm = TRUE)`.
    x unused argument (na.rm = TRUE)
    i The error occurred in group 1: group = 1.
    Run `rlang::last_error()` to see where the error occurred.

CodePudding user response:

There is no na.rm argument in cor, it is use. According to ?cor, the usage is

cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))

use - an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".

library(dplyr)
xx %>%
   group_by(group) %>%
   summarize(COR=cor(a,b, use = "complete.obs"))

-output

# A tibble: 4 × 2
  group   COR
  <int> <dbl>
1     1 0.166
2     2 0.190
3     3 0.190
4     4 0.190

If there are groups with all NA, then use "na.or.complete" (updated data in the comments with groups having only NA)

xx %>%
    group_by(group) %>%
    summarize(COR=cor(a,b, use = "na.or.complete"))
# A tibble: 5 × 2
  group     COR
  <int>   <dbl>
1     1  0.0345
2     2 -0.397 
3     3  0.150 
4     4  0.376 
5     5 NA     

which returns the same with an if/else condition and using "complete.obs"

xx %>%
    group_by(group) %>%
    summarize(COR= if(any(complete.cases(a, b)))
     cor(a,b, use = "complete.obs") else NA_real_)
# A tibble: 5 × 2
  group     COR
  <int>   <dbl>
1     1  0.0345
2     2 -0.397 
3     3  0.150 
4     4  0.376 
5     5 NA   
  • Related