Home > Software engineering >  Why does R ignore relevel when using group_nest()?
Why does R ignore relevel when using group_nest()?

Time:04-06

As a continuation from this question, I'm trying to efficiently perform many logistic regression in order to generate a column saying if a group differs significantly from my reference group.

When I try to nest my data by just one column, this solution works beautifully. However, now that I need to group by two columns, the code runs, but I cannot change the reference group. I've tried the following:

  1. Adding a relevel argument (shown below)
  2. Adding a relevel argument within the custom function itself (also shown below)
  3. Renaming the desired reference group to start with 'AAA' to trick R into making it the first option

Here's a sample dataset:

library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
library(broom)

test <- tibble(
  major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
  app_deadline = ymd(c(rep("'2021-04-04", 3), rep("'2020-03-23", 3), rep("'2019-05-23", 1))),
  time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
  admit = c(500, 1000, 450, 800, 300, 100, 1000),
  reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)

test2 <- test %>%
  mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
  mutate(accept_rate = admit/total)

Here's the code that won't let me change the reference level:

#Custom function --note that english has been set as reference level
library(tidyr)
library(dplyr)
library(purrr)
library(broom)


get_model_t <- function(df) {
  tryCatch(
    expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
    error = function(e) NULL, warning=function(w) NULL)
}

#putting it altogether--note again that english has been marked as reference level
test2 %>% 
  # create year column
  mutate(year = year(time), 
         major = relevel(major, "english")) %>%
  
  # nest by year
  group_nest(year, app_deadline) %>% 
  
  # compute regression
  mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>% 
  
  
  # get data and regression results back to tibble form
  unnest(c(data, reg_tidy)) %>% 
  filter(term != "(Intercept)") %>%
  
  # create the significant yes/no column
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>% 
  
  # remove the unnecessary columns
  select(-c(term, estimate, std.error, statistic, p.value, reg)) %>%
  full_join(test2)

#Note that, based on the significance column, it's clear that 'undeclared' is being used as the reference group

Why is this happening? For a solution, I'd prefer if it could be flexible--i.e., not just work for 'english' but could also be switched to work for 'computer science' too.

CodePudding user response:

It does respect the relevel() function, the problem, such as it is, is that the returned results don't match with the major column. See what happens if you stop at the unnest() function:

test2 <- test %>%
  mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
  mutate(accept_rate = admit/total)


get_model_t <- function(df) {
  tryCatch(
    expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
    error = function(e) NULL, warning=function(w) NULL)
}

#putting it altogether--note again that english has been marked as reference level
tmp <- test2 %>% 
  # create year column
  mutate(year = year(time), 
         major = relevel(major, "english")) %>%
  
  # nest by year
  group_nest(year, app_deadline) %>% 
  
  # compute regression
  mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>% 
  
  
  # get data and regression results back to tibble form
  unnest(c(data, reg_tidy)) 

Now, look at major and term

tmp %>% select(major, term)
# # A tibble: 6 × 2
#   major            term                                               
#   <fct>            <chr>                                              
# 1 undeclared       "(Intercept)"                                      
# 2 computer science "relevel(major, ref = \"english\")computer science"
# 3 english          "relevel(major, ref = \"english\")undeclared"      
# 4 undeclared       "(Intercept)"                                      
# 5 computer science "relevel(major, ref = \"english\")computer science"
# 6 english          "relevel(major, ref = \"english\")undeclared"      

You can see that the rows where major is "english" are actually for the "undeclared" parameter estimate. Taking the above result, I think you can capture what you want with the following:

tmp %>% 
  filter(term != "(Intercept)") %>% 
  mutate(major = gsub(".*\\)(.*)", "\\1", term)) %>% 
  # create the significant yes/no column
  mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>% 
  
  # remove the unnecessary columns
  select(year, app_deadline, major, time, significant) %>%
  full_join(test2)

# Joining, by = c("app_deadline", "major", "time")
# # A tibble: 7 × 9
#   year app_deadline major            time       significant admit reject total accept_rate
#   <dbl> <date>       <chr>            <date>     <chr>       <dbl>  <dbl> <dbl>       <dbl>
# 1  2020 2020-03-23   computer science 2020-01-01 Yes           300    100   400       0.75 
# 2  2020 2020-03-23   undeclared       2020-01-01 Yes           800    210  1010       0.792
# 3  2021 2021-04-04   computer science 2021-01-01 Yes          1000    300  1300       0.769
# 4  2021 2021-04-04   undeclared       2021-01-01 No            500   1000  1500       0.333
# 5    NA 2021-04-04   english          2021-01-01 NA            450   1000  1450       0.310
# 6    NA 2020-03-23   english          2020-01-01 NA            100    900  1000       0.1  
# 7    NA 2019-05-23   undeclared       2019-01-01 NA           1000   1500  2500       0.4  

  • Related