I would like to run some regression models with different y
(so the independent variables stay the same for all the models), and then extract the residuals from each of these models and add them to the original data set.
I will use diamonds
to show what I came up with:
# In my example, the models are: x or y or z = carat cut color clarity price
dependent = c("x", "y", "z")
model = function(y, dataset) {
a = map(
setNames(y, y), ~ glm(reformulate(termlabels = c("carat", "cut", "color", "clarity", "price"),
response = y),
family = gaussian(link = "identity"),
data = dataset
))
resids = map_dfr(a, residuals)
df = bind_cols(dataset, resids)
print(df)
}
model(y = dependent, dataset = diamonds)
But this code doesn't work. I would also like to have sensible names for the residuals that are added as new columns, otherwise it is difficult to differentiate the residuals when the number of models is big.
CodePudding user response:
generate example data
library(tidyverse)
set.seed(101)
dd <- diamonds
dependent <- c("x", "y", "z")
for (d in dependent) {
dd[[d]] <- rnorm(nrow(diamonds))
}
process
library(broom)
res <- (dependent
## set names so .id = argument works downstream
%>% setNames(dependent)
## construct list of formulas
%>% map(reformulate, termlabels = c("carat", "cut", "color", "clarity", "price"))
## fit glmes
%>% map(glm, family = gaussian(link = "identity"), dd,
na.action = na.exclude)
## compute resids (add observation number) and collapse to tibble
%>% map_dfr(~tibble(.obs=seq(nrow(dd)), .resid = residuals(.)), .id = "response")
## widen data → residuals from each response variable as a column
%>% pivot_wider(names_from = "response", values_from = ".resid",
names_prefix ="res_")
%>% select(-.obs)
)
## combine with original data
res2 <- bind_cols(dd, res)
Notes:
- it's not obvious to me why you're using
glm(., family = gaussian(link = "identity))
here, unless it's as a placeholder to something more complicated you're doing with your real data. (If this is your actual model then usinglm()
will be simpler and faster.) - adding
na.action = na.exclude
to the[g]lm()
call will includeNA
values in the predictions, residuals, etc., which will help your residuals line up better with the original data.