Using an if-statement inside purrr::map to avoid missing data errors-CodePudding

I'm using purrr to run a series of single linear regressions across multiple columns of a grouped dataset, but am having trouble excluding groups of variables that have no data without deleting the entire group.

Thanks to andrew_reece here, I got the base code working as:

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtcars %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

which gives a tibble in this format:

cyl iv      r.squared       p.value
4   disp    0.6484051396    0.002782827 
4   hp      0.2740558319    0.098398581 
4   drat    0.180           0.193  
4   wt      0.509           0.0137 
4   qsec    0.0557          0.485  
4   vs      0.00238         0.887  
...
6   disp    0.0106260401    0.825929685 
...

Unfortunately, my real dataset is messy and contains multiple group-variable combinations with only NAs, or with less than two real values, which lm can't handle. To show this, here is mtcars with some data in 'disp' replaced with NA. Run through the above code, mtna throws a NA-error.

#create mtcars dataset that will have a cyl group with entirely NA disp
mtna <- mtcars 
mtna$disp[mtna$disp < 147] <- NA

test <- mtna %>% group_by(cyl) %>% summarize(mean = mean(disp))

I tried to deal with this by making the lm conditional, and first using sum(!is.na) to check if there are enough real values to run lm. This allows the lm to run successfully.

mtna %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      tmpvar <- eval(parse(text = paste0("data$", iv)))
      if(sum(!is.na(tmpvar)) < 3) {return(NA)} else {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }}, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

#which gives:
       cyl    iv        r.squared  p.value
 1     4      NA        NA         NA     
 2     6      disp      0.0115     0.840 
 3     6      hp        0.0161     0.786 
 4     6      drat      0.0132     0.807 
...

However, when you look at the results, you can see that the NA has extended to the whole group, including variables other than disp (which is the only one that had missing values). There is now no data related to cyl = 4 at all, even in groups like hp and drat, which had no missing data.

What I was hoping for was something like:

cyl    iv        r.squared      p.value
4      disp      NA             NA 
4      hp        0.2740558319   0.098398581   # Currently missing
4      drat      0.1799791311   0.193450651   # Currently missing     
4      wt        0.5086325963   0.013742782.  # Currently missing
... 
6      disp      0.0106260401   0.825929685 
6      hp        0.0161462379   0.78602021  
...

I suspect this has something to do with the data format - I guess I'm mapping NA across all the results for that group, instead of just that one variable. But I have no idea how to address this. Any help is greatly appreciated!

CodePudding user response：

I think you are almost there. Take a look at the code below. Changes are commented.

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtna <- mtcars 
mtna$disp[mtna$disp < 147] <- NA

mtna  %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      tmpvar<-eval(parse(text = paste0("data$", iv)))
      if(!is.na(sum(tmpvar))) {                          #only use complete data
        frml <- as.formula(paste("mpg", "~", iv))
        lm(frml, data = data) %>% broom::glance()
      }}, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value) %>% 
  right_join(.,expand.grid(cyl=unique(mtna$cyl),iv=ivs),
             by=c("cyl","iv")) %>%                       #populating with NA for columns lost before
  arrange(., cyl, iv) %>%                                #sort by cyl and iv
  as.data.frame()
  

  #which gives:
       cyl   iv    r.squared    p.value
    1    4   am 0.2872892493 0.08921640
    2    4 carb 0.0378466325 0.56650426
    3    4 disp           NA         NA
    4    4 drat 0.1799791311 0.19345065
    5    4 gear 0.1146552225 0.30840242
    6    4   hp 0.2740558319 0.09839858
    7    4 qsec 0.0556742395 0.48487154
    8    4   vs 0.0023819528 0.88668635
    9    4   wt 0.5086325963 0.01374278
    10   6   am 0.2810551424 0.22094408
    11   6 carb 0.0000661434 0.98619369
    12   6 disp           NA         NA
    13   6 drat 0.0131597553 0.80653099
    14   6 gear 0.0000901510 0.98388187
    15   6   hp 0.0161462379 0.78602021
    16   6 qsec 0.1753245893 0.34980138
    17   6   vs 0.2810551424 0.22094408
    18   6   wt 0.4645101506 0.09175766
    19   8   am 0.0024647887 0.86615546
    20   8 carb 0.1550637156 0.16359436
    21   8 disp 0.2701577717 0.05677488
    22   8 drat 0.0022975223 0.87074078
    23   8 gear 0.0024647887 0.86615546
    24   8   hp 0.0804491933 0.32575378
    25   8 qsec 0.0108860059 0.72261712
    26   8   vs 0.0000000000         NA
    27   8   wt 0.4229655365 0.01179281