Home > database >  How to store both function and its input data inside designated tibble columns, then iterate over ro
How to store both function and its input data inside designated tibble columns, then iterate over ro

Time:12-02

I'm trying to run a data wrangling procedure inside a tibble using tools from {purrr} package. My method is to organize everything I need inside a tibble:

  • the input data inside a column
  • the function to apply upon the input data gets its own column too

My problem: how can I use purrr's mapping functions to say "take the function stored in column x and apply it over the data in column y"?

Below is a minimal example, based on mtcars and iris. I want to summarise each data set, in the same workflow: first subset columns, then do some aggregation. For the aggregation part, I preemptively set up 2 functions, one for each data.

  • summarise_iris()
  • summarise_mtcars()

Then I organize all I need inside a tibble (see trb object below).

The first part, the subsetting, works well. As can be seen in trb_1 below, dat_selected is a new column I mutated, which stores the output of the subset step.

However, the second part is not working. I want to take the function in column summarise_func and apply it over the data stored in column dat_selected. But it's not working. Why not? I purposely used map() because it maps only 1 input to the function.

library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)

summarise_iris <- function(.dat) {
  .dat %>%
    group_by(Species) %>%
    summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))
}
# to test: iris %>% summarise_iris()

summarise_mtcars <- function(.dat) {
  .dat %>%
    group_by(am) %>%
    summarise(mpg_median = median(mpg))
}
# to test: mtcars %>% summarise_mtcars()

trb <- 
  tribble(~original_data, ~cols_to_select,                               ~summarise_func,
          mtcars,         c("am", "disp", "mpg"),                        ~summarise_mtcars(.),
          iris,           c("Species", "Sepal.Length", "Sepal.Width"),   ~summarise_iris(.)
  )

trb_1 <- 
  trb %>%
  mutate(dat_selected   = map2(.x = original_data, .y = cols_to_select, .f = ~select(.x, all_of(.y)))) 

trb_1
#> # A tibble: 2 x 4
#>   original_data  cols_to_select summarise_func dat_selected  
#>   <list>         <list>         <list>         <list>        
#> 1 <df [32 x 11]> <chr [3]>      <formula>      <df [32 x 3]> 
#> 2 <df [150 x 5]> <chr [3]>      <formula>      <df [150 x 3]>

trb_1 %>%
  mutate(dat_summarised = map(.x = dat_selected, .f = summarise_func))
#> Error: Problem with `mutate()` column `dat_summarised`.
#> i `dat_summarised = map(.x = dat_selected, .f = summarise_func)`.
#> x Index 1 must have length 1, not 2

Created on 2021-12-02 by the reprex package (v2.0.1.9000)


How can I achieve the desired output (see below) using the in-table method I'm trying to incorporate? I.e.:

trb_1 %>%
  mutate(dat_summarised = map(.x = dat_selected, .f = summarise_func))

## to give the desired output that's equivalent to what we get if we run:
summar_mtcars <- mtcars %>% summarise_mtcars()
summar_iris   <- iris %>% summarise_iris()

trb_1 %>%
  tibble::add_column(dat_summarised = list(summar_mtcars, summar_iris))

## # A tibble: 2 x 5
##   original_data  cols_to_select summarise_func dat_selected   dat_summarised  
##   <list>         <list>         <list>         <list>         <list>          
## 1 <df [32 x 11]> <chr [3]>      <formula>      <df [32 x 3]>  <tibble [2 x 2]>
## 2 <df [150 x 5]> <chr [3]>      <formula>      <df [150 x 3]> <tibble [3 x 3]>

UPDATE


I don't know if the following is in the right direction, but based on this answer, I thought to utilize rlang::as_function() such that:

trb_1 %>%
  mutate(dat_summarised = map(.x = dat_selected, .f = ~rlang::as_function(summarise_func)))

But it gives a different error now:

x Can't convert a list to function

CodePudding user response:

I would store the functions as strings:

trb <- 
  tribble(~original_data, ~cols_to_select,                               ~summarise_func,
          mtcars,         c("am", "disp", "mpg"),                        "summarise_mtcars",
          iris,           c("Species", "Sepal.Length", "Sepal.Width"),   "summarise_iris"
  )

Then you can simply use do.call in your map call. Or you convert your functions to strings on the fly with mutate:

trb_2 <- trb_1 %>%
  mutate(summarise_func = as.character(summarise_func)) %>% 
  mutate(dat_summarised = map2(summarise_func, original_data, ~ do.call(what = .x, args = list(.dat = .y))))

trb_2
#> # A tibble: 2 × 5
#>   original_data  cols_to_select summarise_func   dat_selected   dat_summarised  
#>   <list>         <list>         <chr>            <list>         <list>          
#> 1 <df [32 × 11]> <chr [3]>      summarise_mtcars <df [32 × 3]>  <tibble [2 × 2]>
#> 2 <df [150 × 5]> <chr [3]>      summarise_iris   <df [150 × 3]> <tibble [3 × 3]>

Created on 2021-12-02 by the reprex package (v2.0.1)

Update: Storing functions or rather function names as strings can be problematic if the underlying function changes (I get that now). The problem is getting the function into the tibble in the first place. What you do in the question is storing it as a formula. A better way is (imo) to store it in a list column:

trb <- 
  tribble(~original_data, ~cols_to_select,                               ~summarise_func,
          mtcars,         c("am", "disp", "mpg"),                        list(fun = summarise_mtcars),
          iris,           c("Species", "Sepal.Length", "Sepal.Width"),   list(fun = summarise_iris)
  )

With a slight adaptation, this original answer then works like this:

trb_3 <- trb_1 %>%
  mutate(dat_summarised = map2(summarise_func, original_data, ~ do.call(what = .x$fun, args = list(.dat = .y))))


trb_3
#> # A tibble: 2 × 5
#>   original_data  cols_to_select summarise_func   dat_selected   dat_summarised  
#>   <list>         <list>         <list>           <list>         <list>          
#> 1 <df [32 × 11]> <chr [3]>      <named list [1]> <df [32 × 3]>  <tibble [2 × 2]>
#> 2 <df [150 × 5]> <chr [3]>      <named list [1]> <df [150 × 3]> <tibble [3 × 3]>

Created on 2021-12-02 by the reprex package (v2.0.1)

CodePudding user response:

I think you can take a simpler approach. First, we don't need to select columns, that's inherent to summarize anyway. Let's create columns that define the columns to group by, the columns to summarize, and functions to use.

library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)


trb <- 
  tribble(~original_data, ~cols_to_group, ~cols_to_summarize,    ~summarise_func,
          mtcars,         "am",           "mpg",                 \(x) mean(x, na.rm = T),
          iris,           "Species",      ~starts_with("Sepal"), median
  )

The \(x) mean(x, na.rm = TRUE) syntax is the new anonymous function syntax in R 4.1. If using an earlier version, just change to function(x) mean(...)

Now we can define a function (to eventually use in pmap that accepts the data, grouping columns, columns to analyse, and the summarize functions.

summarize_fun <- function(
  .dat, .group_cols, .summ_cols, .funs
) {
  .dat %>%
    group_by(across(!!.group_cols)) %>%
    summarize(across(!!.summ_cols, .funs))

}

And now we can just use these within mutate(pmap(...)) to get the result we want. I rely on !! for unquoting expressions because that works for passing in things like ~starts_with("Sepal"), which don't work with {{ }} to my knowledge.

trb_final <- trb %>%
  mutate(dat_summarized = pmap(
    list(
      .dat=original_data,
      .group_cols=cols_to_group,
      .summ_cols=cols_to_summarize,
      .funs=summarise_func
    ),
    summarize_fun
  ))

trb_final
#> # A tibble: 2 × 5
#>   original_data  cols_to_group cols_to_summarize summarise_func dat_summarized  
#>   <list>         <chr>         <list>            <list>         <list>          
#> 1 <df [32 × 11]> am            <chr [1]>         <fn>           <tibble [2 × 2]>
#> 2 <df [150 × 5]> Species       <formula>         <fn>           <tibble [3 × 3]>

trb_final$dat_summarized
#> [[1]]
#> # A tibble: 2 × 2
#>      am   mpg
#>   <dbl> <dbl>
#> 1     0  17.1
#> 2     1  24.4
#> 
#> [[2]]
#> # A tibble: 3 × 3
#>   Species    Sepal.Length Sepal.Width
#>   <fct>             <dbl>       <dbl>
#> 1 setosa              5           3.4
#> 2 versicolor          5.9         2.8
#> 3 virginica           6.5         3

General functions

If instead as in the comments, we want just to apply generic functions to summarize, then just rely on pmap with 2 arguments, the data and the summarizing function.

my_summarize <- function(
  .dat, .func
) {
  .func(.dat)
}


summarize_mtcars <- function(.dat) {
  .dat %>%
    group_by(am) %>%
    summarise(mpg_median = median(mpg))
}

summarize_iris <- function(.dat) {
  .dat %>%
    group_by(Species) %>%
    summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))
}

Now we can just define our data frame to analyze using the original data and the two summarize_... functions we defined for the datasets.

trb <- 
  tribble(~original_data, ~summarize_func,
          mtcars,         summarize_mtcars,
          iris,           summarize_iris
  )

And then just use pmap as before (can also use map2 of course).

trb_final <- trb %>%
  mutate(dat_summarized = pmap(
    list(
      .dat=original_data,
      .func=summarize_func
    ),
    my_summarize
  ))

trb_final
#> # A tibble: 2 × 3
#>   original_data  summarize_func dat_summarized  
#>   <list>         <list>         <list>          
#> 1 <df [32 × 11]> <fn>           <tibble [2 × 2]>
#> 2 <df [150 × 5]> <fn>           <tibble [3 × 3]>

trb_final$dat_summarized
#> [[1]]
#> # A tibble: 2 × 2
#>      am mpg_median
#>   <dbl>      <dbl>
#> 1     0       17.3
#> 2     1       22.8
#> 
#> [[2]]
#> # A tibble: 3 × 3
#>   Species    Sepal.Length Sepal.Width
#>   <fct>             <dbl>       <dbl>
#> 1 setosa             5.01        3.43
#> 2 versicolor         5.94        2.77
#> 3 virginica          6.59        2.97
  • Related