I'm sure I'm missing something about how grouping works. When I use my own function within a summarize statement (after grouping) I get the same result for each group, which is wrong. Also I don't get any errors or warnings, it's just silently giving me the wrong answer.
My goal is to get this custom function to play nice with group_by.
Here is the code:
library(dplyr)
#data
transect <- data.frame(acronym = c("ABEESC", "ABIBAL", "AMMBRE", "ANTELE", "ABEESC", "ABIBAL", "AMMBRE"),
quad_id = c(1, 1, 1, 1, 2, 2, 2))
#scores
c_scores <- data.frame(acronym = c("ABEESC", "ABIBAL", "AMMBRE", "ANTELE"),
c = c(5, 6, 6, 10))
#custom fun
my_fun <- function(data, scores){
join <- left_join(data, scores, by = "acronym")
mean <- mean(join$c)
return(mean)
}
#this works
my_fun(transect, c_scores)
#this also works
transect %>% my_fun(., c_scores)
#this doesn't...
transect %>%
group_by(quad_id) %>%
summarise(mean_c = my_fun(., scores = c_scores))
this is my result:
quad_id | mean_c |
---|---|
1 | 6.29 |
2 | 6.29 |
this is what I want:
quad_id | mean_c |
---|---|
1 | 6.75 |
2 | 5.66 |
CodePudding user response:
We may use cur_data()
as input to the function instead of .
as .
can take the full dataset instead of subset of data in the group
library(dplyr)
transect %>%
group_by(quad_id) %>%
summarise(mean_c = my_fun(cur_data(), scores = c_scores))
-output
# A tibble: 2 × 2
quad_id mean_c
<dbl> <dbl>
1 1 6.75
2 2 5.67
If we want a message
when it is grouped, then use is_grouped_df
my_fun2 <- function(data, scores)
{
if(dplyr::is_grouped_df(data))
{
message("data is grouped, so use cur_data() as data")
}
left_join(data, scores, by = "acronym") %>%
pull(c) %>%
mean
}
-testing
> transect %>%
group_by(quad_id) %>%
summarise(mean_c = my_fun2(., scores = c_scores))
data is grouped, so use cur_data() as data
data is grouped, so use cur_data() as data
# A tibble: 2 × 2
quad_id mean_c
<dbl> <dbl>
1 1 6.29
2 2 6.29
> transect %>%
group_by(quad_id) %>%
summarise(mean_c = my_fun2(cur_data(), scores = c_scores))
# A tibble: 2 × 2
quad_id mean_c
<dbl> <dbl>
1 1 6.75
2 2 5.67
Note that the messages are repeated as the function is applied multiple times (n number of groups) after the grouping when it is inside summarise
. If we do it outside, the message will be printed once
> transect %>%
group_by(quad_id) %>%
my_fun2(., c_scores)
data is grouped, so use cur_data() as data
[1] 6.285714
If we want a single function, we may also do
my_fun3 <- function(data, scores, grps = NULL)
{
data <- left_join(data, scores, by = "acronym")
if(!missing(grps))
{
data <- data %>%
group_by(across(all_of(grps)))
}
data %>%
summarise(mean_c = mean(c, na.rm = TRUE))
}
-testing
> my_fun3(transect, c_scores, "quad_id")
# A tibble: 2 × 2
quad_id mean_c
<dbl> <dbl>
1 1 6.75
2 2 5.67
>
> my_fun3(transect, c_scores)
mean_c
1 6.285714
or simplify without any if
condition using missing
by making use of any_of
in group_by
my_fun3 <- function(data, scores, grps = NULL)
{
left_join(data, scores, by = "acronym") %>%
group_by(across(any_of(grps))) %>%
summarise(mean_c = mean(c, na.rm = TRUE))
}
-testing
> my_fun3(transect, c_scores, "quad_id")
# A tibble: 2 × 2
quad_id mean_c
<dbl> <dbl>
1 1 6.75
2 2 5.67
> my_fun3(transect, c_scores)
# A tibble: 1 × 1
mean_c
<dbl>
1 6.29