Use purrr to run multiple regression models with changing outcomes and then extract residuals-CodePudding

I would like to run some regression models with different y (so the independent variables stay the same for all the models), and then extract the residuals from each of these models and add them to the original data set.

I will use diamonds to show what I came up with:

# In my example, the models are: x or y or z = carat   cut   color   clarity   price

dependent = c("x", "y", "z")

model = function(y, dataset) {
 a = map(
   setNames(y, y), ~ glm(reformulate(termlabels = c("carat", "cut", "color", "clarity", "price"),
                                     response = y),
                         family = gaussian(link = "identity"),
                         data = dataset
   )) 
 
 resids = map_dfr(a, residuals)
 
 df = bind_cols(dataset, resids)

 print(df)

}

model(y = dependent, dataset = diamonds)

But this code doesn't work. I would also like to have sensible names for the residuals that are added as new columns, otherwise it is difficult to differentiate the residuals when the number of models is big.

CodePudding user response：

generate example data

library(tidyverse)

set.seed(101)
dd <- diamonds
dependent <- c("x", "y", "z")
for (d in dependent) {
  dd[[d]] <- rnorm(nrow(diamonds))
}

process

library(broom)
res <- (dependent
  ## set names so .id = argument works downstream
  %>% setNames(dependent)
  ## construct list of formulas
  %>% map(reformulate, termlabels = c("carat", "cut", "color", "clarity", "price"))
  ## fit glmes
  %>% map(glm, family = gaussian(link = "identity"), dd,
          na.action = na.exclude)
  ## compute resids (add observation number) and collapse to tibble
  %>% map_dfr(~tibble(.obs=seq(nrow(dd)), .resid = residuals(.)), .id = "response")
  ## widen data → residuals from each response variable as a column
  %>% pivot_wider(names_from = "response", values_from = ".resid", 
                  names_prefix ="res_")
  %>% select(-.obs)
)
## combine with original data
res2 <- bind_cols(dd, res)

Notes:

it's not obvious to me why you're using glm(., family = gaussian(link = "identity)) here, unless it's as a placeholder to something more complicated you're doing with your real data. (If this is your actual model then using lm() will be simpler and faster.)
adding na.action = na.exclude to the [g]lm() call will include NA values in the predictions, residuals, etc., which will help your residuals line up better with the original data.