Writing an R function, which only subsets when stated-CodePudding

I am trying to write a function that pulls out means, and min and max from a data frame for a specific column (depth), and it can be classed by two categorical variables, so in the function one is grouped by type variable. The other categorical variable is that the data got collected either in 2020 or 2021. I want the default function to pull out data for all years, unless stated in the argument and then subset the data by year. It would also be nice if I could change the variable (eg length instead of depth). Here is my code

analysis <- function(data = measurements, yearX = 2020){
  data %>%
     subset(year == yearX) %>% #Subsets the dataset by specific year
    group_by(type)%>% groups the data by type 
    summarise(mBD = mean(depth), sdBD = sd(depth), minBD = min(depth), maxBD = max(depth), median = median(depth), range = (max(depth) - min(depth)))
}

CodePudding user response：

One option to achieve your desired result may look like so:

set.seed(123)

measurements <- data.frame(
  year = rep(2020:2021, each = 10),
  type = rep(c("A", "B")),
  length = runif(20),
  depth = runif(20)
)

library(dplyr)

analysis <- function(data = measurements, x, yearX = NULL) {
  # Subset by year if given
  if (!is.null(yearX)) data <- filter(data, year %in% yearX) 
  data %>%
    group_by(type) %>%
    summarise(across({{x}}, .fns = list(
      mBD = mean, 
      sdBD = sd, 
      minBD = min, 
      maxBD = max, 
      median = median, 
      range = ~ diff(range(.x))), .names = "{.fn}"
      ))
}

analysis(x = depth)
#> # A tibble: 2 × 7
#>   type    mBD  sdBD  minBD maxBD median range
#>   <chr> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
#> 1 A     0.577 0.290 0.0246 0.963  0.648 0.938
#> 2 B     0.576 0.299 0.147  0.994  0.643 0.847

analysis(measurements, depth, 2020)
#> # A tibble: 2 × 7
#>   type    mBD  sdBD minBD maxBD median range
#>   <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
#> 1 A     0.604 0.217 0.289 0.890  0.641 0.600
#> 2 B     0.627 0.307 0.147 0.994  0.693 0.847

analysis(measurements, length, 2021)
#> # A tibble: 2 × 7
#>   type    mBD  sdBD  minBD maxBD median range
#>   <chr> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
#> 1 A     0.462 0.348 0.103  0.957  0.328 0.854
#> 2 B     0.584 0.370 0.0421 0.955  0.573 0.912

CodePudding user response：

You can refer to a flexible variable within a function in dplyr with curly brackets. And rather thinking about filtering by year or not, it's probably easier to always filter by year, but set the default to be all years (i.e. we don't actually filter anything out). So I've changed your == to %in% to allow more flexibility in the year input.

library(dplyr)
set.seed(1)
measurements <-
  tibble(
    year = rep(2012:2020, each = 5),
    depth = rnorm(45),
    length = rnorm(45),
    type = rep(c("A", "B", "C"), 15)
  )
all_years <- unique(measurements$year)

analysis <- function(data = measurements, summary_var = depth, yearX = all_years){
  df <-
    data %>%
      filter(year %in% yearX) %>% 
      group_by(type)%>% 
      summarise(mBD = mean({{summary_var}}), 
                sdBD = sd({{summary_var}}), 
                minBD = min({{summary_var}}), 
                maxBD = max({{summary_var}}), 
                median = median({{summary_var}}), 
                range = (max({{summary_var}}) - min({{summary_var}})))
  return(df)
}

Some examples of output

analysis()
#> # A tibble: 3 x 7
#>   type      mBD  sdBD minBD maxBD  median range
#>   <chr>   <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
#> 1 A      0.241  0.832 -1.47  1.60  0.487   3.07
#> 2 B     -0.0320 0.874 -2.21  1.51 -0.0162  3.73
#> 3 C      0.0467 0.888 -1.99  1.12  0.388   3.11

analysis(summary_var = length, year = 2018)
#> # A tibble: 3 x 7
#>   type       mBD   sdBD    minBD    maxBD   median range
#>   <chr>    <dbl>  <dbl>    <dbl>    <dbl>    <dbl> <dbl>
#> 1 A      0.183    0.154  0.0743   0.291    0.183   0.217
#> 2 B     -0.516    0.103 -0.590   -0.443   -0.516   0.146
#> 3 C      0.00111 NA      0.00111  0.00111  0.00111 0

analysis(summary_var = length, yearX = c(2018:2020))
#> # A tibble: 3 x 7
#>   type     mBD  sdBD  minBD maxBD  median range
#>   <chr>  <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl>
#> 1 A      0.104 0.354 -0.304 0.594 0.0743  0.898
#> 2 B      0.170 0.713 -0.590 1.18  0.333   1.77 
#> 3 C     -0.152 0.966 -1.52  1.06  0.00111 2.59