As a continuation from this question, I want to run many logistic regression equations at once and then note if a group was significantly different from a reference group. This solution works, but it only works when I'm not missing values. Being that my data has 100 equations, it's bound to have missing values, so rather than this solution failing when it hits an error, how can I program it to skip the instances that throw an error?
Here's a modified dataset that's missing cases:
library(dplyr)
library(lubridate)
library(broom)
test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
And here's the solution that works when it has all cases (see dataset here), but when it hits the 2019 grouping that's missing cases, it fails:
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test %>%
# create year column
mutate(year = year(time),
major = relevel(major, "undeclared")) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, ~glm(accept_rate ~ major, data = .,
family = binomial, weights = total, na.action = na.exclude)),
# use broom::tidy to make a tibble out of model object
reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
I also tried wrapping the solution using the custom functions here, but I also couldn't get it to work. Last, I'm also open to other ideas for a solution if it produces a similar output and is resistant to these errors.
CodePudding user response:
You can ignore the error, using a function like this:
get_model <- function(df) {
tryCatch(
glm(accept_rate ~ major, data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
Then, call it in your mutate(reg=map()...)
line like this:
# compute regression
mutate(reg = map(data, get_model), reg_tidy = map(reg, tidy)) %>%
Output:
# A tibble: 4 x 8
year major time admit reject total accept_rate significant
<dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
2 2021 english 2021-01-01 450 1000 1450 0.310 No
3 2020 computer science 2020-01-01 300 100 400 0.75 No
4 2020 english 2020-01-01 100 900 1000 0.1 Yes
CodePudding user response:
One option would be purrr::safely
which allows to take care of errors. To this end I use a helper function glm_safe
which wraps your glm
call inside purrr::safely
. glm_safe
will return a list with two elements, result
and error
. In case everything works fine result
will contain the model object, while element
is NULL
. In case of an error the error message is stored in error
and result
will be NULL
. To use the results in your pipeline we have to extract the result
elements which could be achieved via transpose(reg)$result
.
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test <- test %>%
mutate(total = rowSums(test[, c("admit", "reject")], na.rm = TRUE)) %>%
mutate(accept_rate = admit / total)
glm_safe <- purrr::safely(
function(x) {
glm(accept_rate ~ major,
data = x,
family = binomial, weights = total, na.action = na.exclude
)
}
)
test %>%
# create year column
mutate(
year = year(time),
major = relevel(major, "undeclared")
) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, glm_safe),
reg = transpose(reg)$result) |>
mutate(reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
#> # A tibble: 4 × 8
#> year major time admit reject total accept_rate significant
#> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
#> 2 2021 english 2021-01-01 450 1000 1450 0.310 No
#> 3 2020 computer science 2020-01-01 300 100 400 0.75 No
#> 4 2020 english 2020-01-01 100 900 1000 0.1 Yes