I am trying to write a function that pulls out means, and min and max from a data frame for a specific column (depth), and it can be classed by two categorical variables, so in the function one is grouped by type variable. The other categorical variable is that the data got collected either in 2020 or 2021. I want the default function to pull out data for all years, unless stated in the argument and then subset the data by year. It would also be nice if I could change the variable (eg length instead of depth). Here is my code
analysis <- function(data = measurements, yearX = 2020){
data %>%
subset(year == yearX) %>% #Subsets the dataset by specific year
group_by(type)%>% groups the data by type
summarise(mBD = mean(depth), sdBD = sd(depth), minBD = min(depth), maxBD = max(depth), median = median(depth), range = (max(depth) - min(depth)))
}
CodePudding user response:
One option to achieve your desired result may look like so:
set.seed(123)
measurements <- data.frame(
year = rep(2020:2021, each = 10),
type = rep(c("A", "B")),
length = runif(20),
depth = runif(20)
)
library(dplyr)
analysis <- function(data = measurements, x, yearX = NULL) {
# Subset by year if given
if (!is.null(yearX)) data <- filter(data, year %in% yearX)
data %>%
group_by(type) %>%
summarise(across({{x}}, .fns = list(
mBD = mean,
sdBD = sd,
minBD = min,
maxBD = max,
median = median,
range = ~ diff(range(.x))), .names = "{.fn}"
))
}
analysis(x = depth)
#> # A tibble: 2 × 7
#> type mBD sdBD minBD maxBD median range
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.577 0.290 0.0246 0.963 0.648 0.938
#> 2 B 0.576 0.299 0.147 0.994 0.643 0.847
analysis(measurements, depth, 2020)
#> # A tibble: 2 × 7
#> type mBD sdBD minBD maxBD median range
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.604 0.217 0.289 0.890 0.641 0.600
#> 2 B 0.627 0.307 0.147 0.994 0.693 0.847
analysis(measurements, length, 2021)
#> # A tibble: 2 × 7
#> type mBD sdBD minBD maxBD median range
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.462 0.348 0.103 0.957 0.328 0.854
#> 2 B 0.584 0.370 0.0421 0.955 0.573 0.912
CodePudding user response:
You can refer to a flexible variable within a function in dplyr
with curly brackets. And rather thinking about filtering by year or not, it's probably easier to always filter
by year, but set the default to be all years (i.e. we don't actually filter anything out). So I've changed your ==
to %in%
to allow more flexibility in the year input.
library(dplyr)
set.seed(1)
measurements <-
tibble(
year = rep(2012:2020, each = 5),
depth = rnorm(45),
length = rnorm(45),
type = rep(c("A", "B", "C"), 15)
)
all_years <- unique(measurements$year)
analysis <- function(data = measurements, summary_var = depth, yearX = all_years){
df <-
data %>%
filter(year %in% yearX) %>%
group_by(type)%>%
summarise(mBD = mean({{summary_var}}),
sdBD = sd({{summary_var}}),
minBD = min({{summary_var}}),
maxBD = max({{summary_var}}),
median = median({{summary_var}}),
range = (max({{summary_var}}) - min({{summary_var}})))
return(df)
}
Some examples of output
analysis()
#> # A tibble: 3 x 7
#> type mBD sdBD minBD maxBD median range
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.241 0.832 -1.47 1.60 0.487 3.07
#> 2 B -0.0320 0.874 -2.21 1.51 -0.0162 3.73
#> 3 C 0.0467 0.888 -1.99 1.12 0.388 3.11
analysis(summary_var = length, year = 2018)
#> # A tibble: 3 x 7
#> type mBD sdBD minBD maxBD median range
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.183 0.154 0.0743 0.291 0.183 0.217
#> 2 B -0.516 0.103 -0.590 -0.443 -0.516 0.146
#> 3 C 0.00111 NA 0.00111 0.00111 0.00111 0
analysis(summary_var = length, yearX = c(2018:2020))
#> # A tibble: 3 x 7
#> type mBD sdBD minBD maxBD median range
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.104 0.354 -0.304 0.594 0.0743 0.898
#> 2 B 0.170 0.713 -0.590 1.18 0.333 1.77
#> 3 C -0.152 0.966 -1.52 1.06 0.00111 2.59