Home > Blockchain >  When I use my own function in a group_by() and summarize() chain, it incorrectly returns the same re
When I use my own function in a group_by() and summarize() chain, it incorrectly returns the same re

Time:10-28

I'm sure I'm missing something about how grouping works. When I use my own function within a summarize statement (after grouping) I get the same result for each group, which is wrong. Also I don't get any errors or warnings, it's just silently giving me the wrong answer.

My goal is to get this custom function to play nice with group_by.

Here is the code:

library(dplyr)

#data
transect <- data.frame(acronym  = c("ABEESC", "ABIBAL", "AMMBRE", "ANTELE", "ABEESC", "ABIBAL", "AMMBRE"),
                       quad_id = c(1, 1, 1, 1, 2, 2, 2))
#scores
c_scores <- data.frame(acronym  = c("ABEESC", "ABIBAL", "AMMBRE", "ANTELE"),
                       c = c(5, 6, 6, 10))

#custom fun
my_fun <- function(data, scores){
  join <- left_join(data, scores, by = "acronym")
  mean <- mean(join$c)
  return(mean)
}

#this works
my_fun(transect, c_scores)

#this also works
transect %>% my_fun(., c_scores)

#this doesn't...
transect %>%
  group_by(quad_id) %>%
  summarise(mean_c = my_fun(., scores = c_scores))

this is my result:

quad_id mean_c
1 6.29
2 6.29

this is what I want:

quad_id mean_c
1 6.75
2 5.66

CodePudding user response:

We may use cur_data() as input to the function instead of . as . can take the full dataset instead of subset of data in the group

library(dplyr)
transect %>%
  group_by(quad_id) %>%
  summarise(mean_c = my_fun(cur_data(), scores = c_scores))

-output

# A tibble: 2 × 2
  quad_id mean_c
    <dbl>  <dbl>
1       1   6.75
2       2   5.67

If we want a message when it is grouped, then use is_grouped_df

my_fun2 <- function(data, scores)
 {
  
  if(dplyr::is_grouped_df(data))
  {
   message("data is grouped, so use cur_data() as data")
  }
  
 left_join(data, scores, by = "acronym") %>%
       pull(c) %>%
       mean
  
 
}

-testing

 > transect %>%
     group_by(quad_id) %>%
     summarise(mean_c = my_fun2(., scores = c_scores))
 data is grouped, so use cur_data() as data
 data is grouped, so use cur_data() as data
 # A tibble: 2 × 2
   quad_id mean_c
     <dbl>  <dbl>
 1       1   6.29
 2       2   6.29
 > transect %>%
     group_by(quad_id) %>%
     summarise(mean_c = my_fun2(cur_data(), scores = c_scores))
 # A tibble: 2 × 2
   quad_id mean_c
     <dbl>  <dbl>
 1       1   6.75
 2       2   5.67

Note that the messages are repeated as the function is applied multiple times (n number of groups) after the grouping when it is inside summarise. If we do it outside, the message will be printed once

> transect %>% 
    group_by(quad_id) %>% 
    my_fun2(., c_scores)
data is grouped, so use cur_data() as data
[1] 6.285714

If we want a single function, we may also do

my_fun3 <- function(data, scores, grps = NULL)
{
data <- left_join(data, scores, by = "acronym")
if(!missing(grps)) 
{
 data <- data %>%
    group_by(across(all_of(grps)))

}
data %>%
    summarise(mean_c = mean(c, na.rm = TRUE))

}

-testing

>  my_fun3(transect, c_scores, "quad_id")
# A tibble: 2 × 2
  quad_id mean_c
    <dbl>  <dbl>
1       1   6.75
2       2   5.67
> 
> my_fun3(transect, c_scores)
    mean_c
1 6.285714

or simplify without any if condition using missing by making use of any_of in group_by

my_fun3 <- function(data, scores, grps = NULL)
{
left_join(data, scores, by = "acronym") %>%
    group_by(across(any_of(grps))) %>% 
    summarise(mean_c = mean(c, na.rm = TRUE))

}

-testing

> my_fun3(transect, c_scores, "quad_id")
# A tibble: 2 × 2
  quad_id mean_c
    <dbl>  <dbl>
1       1   6.75
2       2   5.67
> my_fun3(transect, c_scores)
# A tibble: 1 × 1
  mean_c
   <dbl>
1   6.29
  • Related