Home > Blockchain >  How to efficiently run many logistic regressions in R and determine which groups are significantly d
How to efficiently run many logistic regressions in R and determine which groups are significantly d

Time:04-05

I have a longitudinal dataset with acceptance data of people of different declared majors. At each time point (2021, 2020, etc.), I want to see if the acceptance rate of undeclared individuals differ significantly (in either direction) from declared individuals.

I'm eventually going to pipe these results into a plot that puts an asterik if groups differ significantly, but I wanted to know if there's an efficient way to perform these logisitic regressions so I get a column for each in my dataset that says if the group was significantly different than the undeclared students at the same time point.

To illustrate, here's a test dataset:

library(dplyr)
library(lubridate)

test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 3))),
               time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 3))),
               admit = c(500, 1000, 450, 800, 300, 100, 1000, 400, 150),
               reject = c(1000, 300, 1000, 210, 100, 900, 1500, 350, 1200)) %>%
  mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE),
         accept_rate = admit/total)

And here's how I would manually perform each regression (but don't want this)

test$major <- relevel(test$major , ref = "undeclared")

just_2021 <- test %>%
  filter(time == '2021-01-01')

m_2021 <- glm(accept_rate ~ major, data = just_2021, weights = total, family = binomial)
summary(m_2021) #english not sig diff from undeclared; CS is sig diff from undeclared

And finally, this is what I'm hoping my dataset looks like:

library(dplyr)
library(lubridate)

answer <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 3))),
               time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 3))),
               admit = c(500, 1000, 450, 800, 300, 100, 1000, 400, 150),
               reject = c(1000, 300, 1000, 210, 100, 900, 1500, 350, 1200)) %>%
  mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE),
         accept_rate = admit/total) %>%
  mutate(dif_than_undeclared_2021 = c(NA_character_, "Yes", "No", rep(NA_character_, 6)),
         dif_than_undeclared_2020 = c(rep(NA_character_, 4), "Yes", "Yes", rep(NA_character_, 3)),
         dif_than_undeclared_2019 = c(rep(NA_character_, 7), "Yes", "Yes"))

answer

I know that purrr can help with iteration, but I don't know if it applies in this case. Any help would be gladly appreciated!

CodePudding user response:

library(broom)
library(tidyr)
library(dplyr)

test %>% 
    # create year column
    mutate(year = year(time), 
           major = relevel(major, "undeclared")) %>% 

    # nest by year
    nest(data = -year) %>% 

    # compute regression
    mutate(reg = map(data, ~glm(accept_rate ~ major, data = ., 
                     family = binomial, weights = total)), 

           # use broom::tidy to make a tibble out of model object
           reg_tidy = map(reg, tidy)) %>% 

    # get data and regression results back to tibble form
    unnest(c(data, reg_tidy)) %>% 
    filter(term != "(Intercept)") %>%

    # create the significant yes/no column
    mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>% 

    # remove the unnecessary columns
    select(-c(term, estimate, std.error, statistic, p.value, reg))
  • Related