I want to tag the samples used in multiple regression models with a purrr
function. Drawing on this Q&A, I can accomplish this goal on an ad hoc basis as follows:
library(dplyr)
df <- mtcars %>%
mutate(disp = replace(hp, c(2, 3), NA)) %>%
mutate(wt = replace(wt, c(3, 4, 5), NA))
s1 <- lm(mpg ~ disp, data = df)
df$samp1 <- TRUE
df$samp1[na.action(s1)] <- FALSE
s2 <- lm(mpg ~ wt, data = df)
df$samp2 <- TRUE
df$samp2[na.action(s2)] <- FALSE
How can I add samp1
and samp2
to df
using purrr
?
CodePudding user response:
There should be a more tidyeval way to do this with across
, but it might end up uglier or more convoluted than it's worth. A simple enough way is to make a list of the models with the names you want for your new columns, create a samp*
column for each, and reduce-join into one data frame. This last bit works because you know you have all the same columns to join on.
library(dplyr)
mods <- list(samp1 = s1, samp2 = s2)
df_out <- purrr::imap(mods, function(mod, col) {
df %>%
tibble::rownames_to_column("id") %>%
mutate({{ col }} := id %in% names(na.action(mod)))
}) %>%
purrr::reduce(inner_join)
#> Joining, by = c("id", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")
head(df_out)
#> id mpg cyl disp hp drat wt qsec vs am gear carb samp1
#> 1 Mazda RX4 21.0 6 110 110 3.90 2.620 16.46 0 1 4 4 FALSE
#> 2 Mazda RX4 Wag 21.0 6 NA 110 3.90 2.875 17.02 0 1 4 4 TRUE
#> 3 Datsun 710 22.8 4 NA 93 3.85 NA 18.61 1 1 4 1 TRUE
#> 4 Hornet 4 Drive 21.4 6 110 110 3.08 NA 19.44 1 0 3 1 FALSE
#> 5 Hornet Sportabout 18.7 8 175 175 3.15 NA 17.02 0 0 3 2 FALSE
#> 6 Valiant 18.1 6 105 105 2.76 3.460 20.22 1 0 3 1 FALSE
#> samp2
#> 1 FALSE
#> 2 FALSE
#> 3 TRUE
#> 4 TRUE
#> 5 TRUE
#> 6 FALSE
If you wanted to go a heavier tidyeval route, some posts where you might find some leads are How can I use map* and mutate to convert a list into a set of additional columns? and Using mutate(across(...)) with purrr::map
CodePudding user response:
I'm not quite there yet but here's a tidy approach using a custom function:
flag_use <- function(df, model, name) {
mutate(df, {{name}} := !row_number() %in% na.action( {{model}} ))
}
df %>%
flag_use(s1, "samp1") %>%
flag_use(s2, "samp2")
CodePudding user response:
This seems too complicated but it's what I could come up with. (It would be much more efficient to do this without running the linear model itself as part of the pipeline, i.e. just identifying which samples were used -- this might be do-able with model.frame()
and some appropriate joining ...
library(dplyr)
library(purrr)
library(broom)
library(tibble)
## same as before, but also convert rownames to a column
df <- mtcars %>%
mutate(disp = replace(hp, c(2, 3), NA),
wt = replace(wt, c(3, 4, 5), NA)) %>%
rownames_to_column("model")
## (1) set up vector of vars and give it names (for later .id=)
dd <- c("disp", "wt") %>%
setNames(c("samp1", "samp2")) %>%
## (2) construct formulas for lm
map(reformulate, response = "mpg") %>%
## (3) fit the lm
map(lm, data = df) %>%
## (4) generate fitted values
map_dfr(augment, newdata=df, .id="samp") %>%
select(samp, model, .fitted) %>%
## (5) identify which observations were *not* used
mutate(val = !is.na(.fitted)) %>%
## (6) pivot from one long column to two half-length columns
pivot_wider(names_from=samp, values_from=val, id_cols= model) %>%
## (7) add to df
full_join(df, by = "model")
This version does it without running the models.
## helper function: returns logical vector of whether observation
## was included in model frame or not
drop_vec <- function(mf) {
nn <- attr(mf, "na.action")
incl <- rep(TRUE, nrow(mf) length(nn))
incl[nn] <- FALSE
incl
}
## first few bits are the same as above
dd <- c("disp", "wt") %>%
setNames(c("samp1", "samp2")) %>%
map(reformulate, response = "mpg") %>%
## only construct model frames - don't run lm()
map(model.frame, data = df) %>%
## apply helper function
map(drop_vec) %>%
## stick them together
bind_cols(df)
The only thing I don't like about this solution is that the samp
columns end up at the beginning; would have to fuss a bit more to get them as the last columns in the data frame.