Using a reference table of variable names in a function and working row wise-CodePudding

I have a long list of time diff calculations I want to make. Given the names of the columns are long and contain spaces (pulled from an API) I've created a table of the variable names to be used for each operation:

stage_refs <- structure(list(new.diff.var = c("time diff 1", "time diff 2", 
                            "time diff 3", "time diff 4"), var.1 = c("time value 2", "time value 3", 
                                                                     "time value 4", "time value 5"), var.2 = c("time value 1", "time value 2", 
                                                                                                                "time value 3", "time value 4")), row.names = c(NA, -4L), spec = structure(list(
                                                                                                                  cols = list(new.diff.var = structure(list(), class = c("collector_character", 
                                                                                                                                                                         "collector")), var.1 = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                            "collector")), var.2 = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                               "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                     "collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                       "tbl_df", "tbl", "data.frame"))

And a date frame of dates with missing values for where no dates are as yet recorded:

date_values <- structure(list(`time value 1` = structure(c(18993, 18993, 18993, 
NA), class = "Date"), `time value 2` = structure(c(19024, NA, 
19026, 19027), class = "Date"), `time value 3` = structure(c(NA, 
19084, 19085, 19086), class = "Date"), `time value 4` = structure(c(19113, 
19114, NA, 19116), class = "Date"), `time value 5` = structure(c(19174, 
19175, 19176, 19177), class = "Date")), row.names = c(NA, -4L
), class = c("tbl_df", "tbl", "data.frame"))

In the production data set there are up to 20 diff time calculations to make, hence why I've put in a table so I can use a function such as:

library(tidyverse)

difftime_fun <- function(x, y, z) {
  date_values |>
    mutate(!!x = difftime(y, z))
}

... to create a new column from x by the calculation between y & z and then apply this function row-wise over stage_refs using a loop:

for(i in 1:nrow(time_diff_stages)) {
  difftime_fun(
    stage_refs$new.diff.var[[i]],
    stage_refs$var.1[[i]],
    stage_refs$var.2[[i]]
 )

What I want is as many timediff columns added to date_values as rows in stage_refs using the variable names from stage_refs for assignment of the cols and for the calculations.

There are 2 places I'm stuck:

I'm running into non-standard-evaluation problems with using the variable names in the function. I've tried combinations of !!, {{ }}, eval etc and can't make sense of it.
I suspect there's a better way than using the loop to go row-wise using apply or some such, but not solving (1) means it's hard to trial and error using apply.

I have looked at this solution, but can't make sense as to how to use for this problem.

Thanks.

For clarification the final df would have the following columns:

[1] "time value 1" "time value 2" "time value 3" "time value 4" "time value 5" "time diff 1" 
[7] "time diff 2"  "time diff 3"  "time diff 4"

CodePudding user response：

To achieve your desired result you could first rewrite your function by adding a data argument and using the .data pronoun which allows to access dat columns passed as character string. After doing so you could e.g. use purrr::reduce or base Reduce to loop over the rows of your stage_refs dataframe and apply your difftime_fun to create your diff time columns.

Note: In row 4 of your stage_refs a column "time value 7" was specified which was not part of your example data. I simply replaced it by "time value 4".

library(tidyverse)

difftime_fun <- function(.data, x, y, z) {
  .data |>
    mutate("{x}" := difftime(.data[[y]], .data[[z]]))
}

purrr::reduce(seq(nrow(stage_refs)), function(.x, i) {
  difftime_fun(.x,
    stage_refs$new.diff.var[[i]],
    stage_refs$var.1[[i]],
    stage_refs$var.2[[i]]
  )
}, .init = date_values)
#> # A tibble: 4 × 9
#>   time val…¹ time val…² time val…³ time val…⁴ time val…⁵ time …⁶ time …⁷ time …⁸
#>   <date>     <date>     <date>     <date>     <date>     <drtn>  <drtn>  <drtn> 
#> 1 2022-01-01 2022-02-01 NA         2022-05-01 2022-07-01 31 days NA days NA days
#> 2 2022-01-01 NA         2022-04-02 2022-05-02 2022-07-02 NA days NA days 30 days
#> 3 2022-01-01 2022-02-03 2022-04-03 NA         2022-07-03 33 days 59 days NA days
#> 4 NA         2022-02-04 2022-04-04 2022-05-04 2022-07-04 NA days 59 days 30 days
#> # … with 1 more variable: `time diff 4` <drtn>, and abbreviated variable names
#> #   ¹`time value 1`, ²`time value 2`, ³`time value 3`, ⁴`time value 4`,
#> #   ⁵`time value 5`, ⁶`time diff 1`, ⁷`time diff 2`, ⁸`time diff 3`

CodePudding user response：

Another alternative is to build function calls from your reference data frame using pmap() and to splice those inside of mutate(). This will be more efficient than iteratively calling mutate().

library(purrr)
library(dplyr)

date_values %>%
  mutate(!!!setNames(pmap(stage_refs, ~ call("difftime", sym(..2), sym(..3))), stage_refs$new.diff.var))

# A tibble: 4 × 9
  `time value 1` `time value 2` `time value 3` `time value 4` `time value 5` `time diff 1` `time diff 2` `time diff 3` `time diff 4`
  <date>         <date>         <date>         <date>         <date>         <drtn>        <drtn>        <drtn>        <drtn>       
1 2022-01-01     2022-02-01     NA             2022-05-01     2022-07-01     31 days       NA days       NA days       61 days      
2 2022-01-01     NA             2022-04-02     2022-05-02     2022-07-02     NA days       NA days       30 days       61 days      
3 2022-01-01     2022-02-03     2022-04-03     NA             2022-07-03     33 days       59 days       NA days       NA days      
4 NA             2022-02-04     2022-04-04     2022-05-04     2022-07-04     NA days       59 days       30 days       61 days