Home > Blockchain >  How to efficiently run many logistic regressions in R and skip over equations that throw errors?
How to efficiently run many logistic regressions in R and skip over equations that throw errors?

Time:04-05

As a continuation from this question, I want to run many logistic regression equations at once and then note if a group was significantly different from a reference group. This solution works, but it only works when I'm not missing values. Being that my data has 100 equations, it's bound to have missing values, so rather than this solution failing when it hits an error, how can I program it to skip the instances that throw an error?

Here's a modified dataset that's missing cases:

library(dplyr)
library(lubridate)
library(broom)

test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
               time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
               admit = c(500, 1000, 450, 800, 300, 100, 1000),
               reject = c(1000, 300, 1000, 210, 100, 900, 1500)) %>%
  mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
  mutate(accept_rate = admit/total)

And here's the solution that works when it has all cases (see dataset here), but when it hits the 2019 grouping that's missing cases, it fails:

library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)

test %>% 
  # create year column
  mutate(year = year(time), 
         major = relevel(major, "undeclared")) %>% 
  
  # nest by year
  nest(data = -year) %>% 
  
  # compute regression
  mutate(reg = map(data, ~glm(accept_rate ~ major, data = ., 
                              family = binomial, weights = total, na.action = na.exclude)), 
         
         # use broom::tidy to make a tibble out of model object
         reg_tidy = map(reg, tidy)) %>% 
  
  # get data and regression results back to tibble form
  unnest(c(data, reg_tidy)) %>% 
  filter(term != "(Intercept)") %>%
  
  # create the significant yes/no column
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>% 
  
  # remove the unnecessary columns
  select(-c(term, estimate, std.error, statistic, p.value, reg))

I also tried wrapping the solution using the custom functions here, but I also couldn't get it to work. Last, I'm also open to other ideas for a solution if it produces a similar output and is resistant to these errors.

CodePudding user response:

You can ignore the error, using a function like this:

get_model <- function(df) {
  tryCatch(
    glm(accept_rate ~ major, data = df, family = binomial, weights = total, na.action = na.exclude),
    error = function(e) NULL, warning=function(w) NULL)
}

Then, call it in your mutate(reg=map()...) line like this:

  # compute regression
  mutate(reg = map(data, get_model), reg_tidy = map(reg, tidy)) %>% 

Output:

# A tibble: 4 x 8
   year major            time       admit reject total accept_rate significant
  <dbl> <fct>            <date>     <dbl>  <dbl> <dbl>       <dbl> <chr>      
1  2021 computer science 2021-01-01  1000    300  1300       0.769 Yes        
2  2021 english          2021-01-01   450   1000  1450       0.310 No         
3  2020 computer science 2020-01-01   300    100   400       0.75  No         
4  2020 english          2020-01-01   100    900  1000       0.1   Yes

CodePudding user response:

One option would be purrr::safely which allows to take care of errors. To this end I use a helper function glm_safe which wraps your glm call inside purrr::safely. glm_safe will return a list with two elements, result and error. In case everything works fine result will contain the model object, while element is NULL. In case of an error the error message is stored in error and result will be NULL. To use the results in your pipeline we have to extract the result elements which could be achieved via transpose(reg)$result.

library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)

test <- tibble(
  major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
  time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
  admit = c(500, 1000, 450, 800, 300, 100, 1000),
  reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)

test <- test %>%
  mutate(total = rowSums(test[, c("admit", "reject")], na.rm = TRUE)) %>%
  mutate(accept_rate = admit / total)

glm_safe <- purrr::safely(
  function(x) {
    glm(accept_rate ~ major,
      data = x,
      family = binomial, weights = total, na.action = na.exclude
    )
  }
)

test %>%
  # create year column
  mutate(
    year = year(time),
    major = relevel(major, "undeclared")
  ) %>%
  # nest by year
  nest(data = -year) %>%
  # compute regression
  mutate(reg = map(data, glm_safe),
         reg = transpose(reg)$result) |> 
  mutate(reg_tidy = map(reg, tidy)) %>%
  # get data and regression results back to tibble form
  unnest(c(data, reg_tidy)) %>%
  filter(term != "(Intercept)") %>%
  # create the significant yes/no column
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
  # remove the unnecessary columns
  select(-c(term, estimate, std.error, statistic, p.value, reg))
#> # A tibble: 4 × 8
#>    year major            time       admit reject total accept_rate significant
#>   <dbl> <fct>            <date>     <dbl>  <dbl> <dbl>       <dbl> <chr>      
#> 1  2021 computer science 2021-01-01  1000    300  1300       0.769 Yes        
#> 2  2021 english          2021-01-01   450   1000  1450       0.310 No         
#> 3  2020 computer science 2020-01-01   300    100   400       0.75  No         
#> 4  2020 english          2020-01-01   100    900  1000       0.1   Yes
  • Related