As a continuation from this question, I'm trying to efficiently perform many logistic regression in order to generate a column saying if a group differs significantly from my reference group.
When I try to nest my data by just one column, this solution works beautifully. However, now that I need to group by two columns, the code runs, but I cannot change the reference group. I've tried the following:
- Adding a relevel argument (shown below)
- Adding a relevel argument within the custom function itself (also shown below)
- Renaming the desired reference group to start with 'AAA' to trick R into making it the first option
Here's a sample dataset:
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
library(broom)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
app_deadline = ymd(c(rep("'2021-04-04", 3), rep("'2020-03-23", 3), rep("'2019-05-23", 1))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
Here's the code that won't let me change the reference level:
#Custom function --note that english has been set as reference level
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg)) %>%
full_join(test2)
#Note that, based on the significance column, it's clear that 'undeclared' is being used as the reference group
Why is this happening? For a solution, I'd prefer if it could be flexible--i.e., not just work for 'english'
but could also be switched to work for 'computer science'
too.
CodePudding user response:
It does respect the relevel()
function, the problem, such as it is, is that the returned results don't match with the major
column. See what happens if you stop at the unnest()
function:
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
tmp <- test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy))
Now, look at major
and term
tmp %>% select(major, term)
# # A tibble: 6 × 2
# major term
# <fct> <chr>
# 1 undeclared "(Intercept)"
# 2 computer science "relevel(major, ref = \"english\")computer science"
# 3 english "relevel(major, ref = \"english\")undeclared"
# 4 undeclared "(Intercept)"
# 5 computer science "relevel(major, ref = \"english\")computer science"
# 6 english "relevel(major, ref = \"english\")undeclared"
You can see that the rows where major
is "english"
are actually for the "undeclared"
parameter estimate. Taking the above result, I think you can capture what you want with the following:
tmp %>%
filter(term != "(Intercept)") %>%
mutate(major = gsub(".*\\)(.*)", "\\1", term)) %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(year, app_deadline, major, time, significant) %>%
full_join(test2)
# Joining, by = c("app_deadline", "major", "time")
# # A tibble: 7 × 9
# year app_deadline major time significant admit reject total accept_rate
# <dbl> <date> <chr> <date> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 2020-03-23 computer science 2020-01-01 Yes 300 100 400 0.75
# 2 2020 2020-03-23 undeclared 2020-01-01 Yes 800 210 1010 0.792
# 3 2021 2021-04-04 computer science 2021-01-01 Yes 1000 300 1300 0.769
# 4 2021 2021-04-04 undeclared 2021-01-01 No 500 1000 1500 0.333
# 5 NA 2021-04-04 english 2021-01-01 NA 450 1000 1450 0.310
# 6 NA 2020-03-23 english 2020-01-01 NA 100 900 1000 0.1
# 7 NA 2019-05-23 undeclared 2019-01-01 NA 1000 1500 2500 0.4