Create a function with dplyr with two env-variables-CodePudding

I am trying to write functions when there are two env-variables. This vignette has multiple examples with one env variable and multiple data variables, but no examples with two env variables. https://dplyr.tidyverse.org/articles/programming.html I could not find a solution at https://adv-r.hadley.nz/ either.

As an example, I start with two data frames. First, I want to join them. Then I want to compute some summary statistics. I want to create a function that can do the work. Note that the number of grouping variables (such as state and people) may change depending on the example. Additionally, the variables that are being summed (such as sales and profit) may also change.

# I need a function
Compute = function(df1, df2, grp_vars, compute_vars) {code}


# An interactive solution: 
library(dplyr)


sales_data = data.frame(staffID = rep(1:5, each = 5),
                 state = c(rep('Cal', 13), rep('Wash', 12)),
                 sales = 101:125,
                 profit = 11:35
                 )

sales_data

staff = data.frame(staffID = 1:5,
                   people = c('Al', 'Barb', 'Carol', 'Dave', 'Ellen'))

staff

res1 = sales_data %>% inner_join(staff, by = 'staffID')
res1

res2 = res1 %>% 
  group_by(state, people) %>% summarize(total_sales = sum(sales), total_profit = sum(profit))
res2
If I only needed to summarize the data, this would work:

# From Programming with dplyr
my_summarise <- function(data, group_var, summarise_var) {
  data %>%
    group_by(across({{ group_var }})) %>% 
    summarise(across({{ summarise_var }}, sum, .names = "sum_{.col}"))
}

my_summarise(res1, c(state, people), c(sales, profit))

Summary. I need a function, Compute = function(df1, df2, grp_vars, compute_vars) {code} First join two data frames when both the joining/grouping variables and the computed variables are selected by the user. Secondly, compute totals and return the results

CodePudding user response：

You could add a third argument by to your function definition and add the join to your function:

library(dplyr)

compute <- function(df1, df2, by, grp_vars, compute_vars) {
  res1 <- df1 %>% 
    inner_join(df2, by = by)  
  
  res1 %>%
    group_by(across({{ grp_vars }})) %>% 
    summarise(across({{ compute_vars }}, sum, .names = "sum_{.col}"), .groups = "drop")
}

compute(sales_data, staff,  'staffID', c(state, people), c(sales, profit))
#> # A tibble: 6 × 4
#>   state people sum_sales sum_profit
#>   <chr> <chr>      <int>      <int>
#> 1 Cal   Al           515         65
#> 2 Cal   Barb         540         90
#> 3 Cal   Carol        336         66
#> 4 Wash  Carol        229         49
#> 5 Wash  Dave         590        140
#> 6 Wash  Ellen        615        165