Tag regression samples for many models using purrr-CodePudding

I want to tag the samples used in multiple regression models with a purrr function. Drawing on this Q&A, I can accomplish this goal on an ad hoc basis as follows:

library(dplyr)

df <- mtcars %>% 
  mutate(disp = replace(hp, c(2, 3), NA)) %>% 
  mutate(wt = replace(wt, c(3, 4, 5), NA))

s1 <- lm(mpg ~ disp, data = df)
df$samp1 <- TRUE
df$samp1[na.action(s1)] <- FALSE         

s2 <- lm(mpg ~ wt, data = df)
df$samp2 <- TRUE
df$samp2[na.action(s2)] <- FALSE

How can I add samp1 and samp2 to df using purrr?

CodePudding user response：

There should be a more tidyeval way to do this with across, but it might end up uglier or more convoluted than it's worth. A simple enough way is to make a list of the models with the names you want for your new columns, create a samp* column for each, and reduce-join into one data frame. This last bit works because you know you have all the same columns to join on.

library(dplyr)

mods <- list(samp1 = s1, samp2 = s2)

df_out <- purrr::imap(mods, function(mod, col) {
  df %>%
    tibble::rownames_to_column("id") %>%
    mutate({{ col }} := id %in% names(na.action(mod)))
}) %>%
  purrr::reduce(inner_join)
#> Joining, by = c("id", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")

head(df_out)
#>                  id  mpg cyl disp  hp drat    wt  qsec vs am gear carb samp1
#> 1         Mazda RX4 21.0   6  110 110 3.90 2.620 16.46  0  1    4    4 FALSE
#> 2     Mazda RX4 Wag 21.0   6   NA 110 3.90 2.875 17.02  0  1    4    4  TRUE
#> 3        Datsun 710 22.8   4   NA  93 3.85    NA 18.61  1  1    4    1  TRUE
#> 4    Hornet 4 Drive 21.4   6  110 110 3.08    NA 19.44  1  0    3    1 FALSE
#> 5 Hornet Sportabout 18.7   8  175 175 3.15    NA 17.02  0  0    3    2 FALSE
#> 6           Valiant 18.1   6  105 105 2.76 3.460 20.22  1  0    3    1 FALSE
#>   samp2
#> 1 FALSE
#> 2 FALSE
#> 3  TRUE
#> 4  TRUE
#> 5  TRUE
#> 6 FALSE

If you wanted to go a heavier tidyeval route, some posts where you might find some leads are How can I use map* and mutate to convert a list into a set of additional columns? and Using mutate(across(...)) with purrr::map

CodePudding user response：

I'm not quite there yet but here's a tidy approach using a custom function:

flag_use <- function(df, model, name) {
  mutate(df, {{name}} := !row_number() %in% na.action( {{model}} ))
}

df %>%
  flag_use(s1, "samp1") %>%
  flag_use(s2, "samp2")

CodePudding user response：

This seems too complicated but it's what I could come up with. (It would be much more efficient to do this without running the linear model itself as part of the pipeline, i.e. just identifying which samples were used -- this might be do-able with model.frame() and some appropriate joining ...

library(dplyr)
library(purrr)
library(broom)
library(tibble)

## same as before, but also convert rownames to a column
df <- mtcars %>%
  mutate(disp = replace(hp, c(2, 3), NA),
         wt = replace(wt, c(3, 4, 5), NA)) %>%
  rownames_to_column("model")

## (1) set up vector of vars and give it names (for later .id=)
dd <- c("disp", "wt") %>%
  setNames(c("samp1", "samp2")) %>%
## (2) construct formulas for lm
  map(reformulate, response = "mpg") %>%
## (3) fit the lm
  map(lm, data = df) %>%
## (4) generate fitted values
  map_dfr(augment, newdata=df, .id="samp") %>%
  select(samp, model, .fitted) %>%
## (5) identify which observations were *not* used
  mutate(val = !is.na(.fitted)) %>%
## (6) pivot from one long column to two half-length columns
  pivot_wider(names_from=samp, values_from=val, id_cols= model) %>%
## (7) add to df
  full_join(df, by = "model")

This version does it without running the models.

## helper function: returns logical vector of whether observation
## was included in model frame or not
drop_vec <- function(mf) {
  nn <- attr(mf, "na.action")
  incl <- rep(TRUE, nrow(mf)   length(nn))
  incl[nn] <- FALSE
  incl
}

## first few bits are the same as above
dd <- c("disp", "wt") %>%
  setNames(c("samp1", "samp2")) %>%
  map(reformulate, response = "mpg") %>%
## only construct model frames - don't run lm()
  map(model.frame, data = df) %>%
## apply helper function
  map(drop_vec) %>%
## stick them together
  bind_cols(df)

The only thing I don't like about this solution is that the samp columns end up at the beginning; would have to fuss a bit more to get them as the last columns in the data frame.