Home > front end >  `dplyr` group_by column not found
`dplyr` group_by column not found

Time:11-04

Working on an RShiny app and am currently having trouble with dplyr's group_by() function. I have two defined functions:

  • gather_info: finds the category with the highest/lowest mean value
  • paste_info: calls gather_info and returns the corresponding category and value

The purpose is to return a string that - given a data frame and categorical variable - states the highest- and lowest-performing category and value of said category.

Calling gather_info with the appropriate arguments works as expected. However, paste_info consistently returns:

Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `grp.col` is not found.

Here's a reproducible example, where the desired output of paste_info is "Given your data, your best performing group is Cat1 scoring 90% and your worst performing group is Cat2 scoring 20%.":

gather_info <- function(df, grp.col, maxm) {
    df |> 
        mutate_if(
            .predicate = function(x) is.character(x),
            .funs = function(x) str_to_title(x)
        ) |> 
        group_by({{ grp.col }}) |> 
        summarize(percentage = round(mean(value, na.rm=TRUE) * 100, 2)) |> 
        arrange(desc(percentage)) %>%  # c'est un pipe
        {if (maxm) head(., 1) else tail(., 1)}
}

paste_info <- function(df, grp.col) {
    
    high_df <- gather_info(df, grp.col, maxm=TRUE)
    low_df <- gather_info(df, grp.col, maxm=FALSE)

    paste0("Given your data, your best performing group is ",
          high_df |> pull(grp.col), " scoring ", high_df$percentage, "%",
          " and your worst performing group is ",
          low_df |> pull(grp.col), " scoring ", low_df$percentage, "%.")
}


df <- data.frame(
    category=c('cat1', 'cat1', 'cat2', 'cat2', 'cat2', 'cat3', 'cat3'),
    value=c(1,0.8,0.2,0.3,0.1,0.5,0.5)
)

# returns category, value with highest mean value
gather_info(df, category, maxm=TRUE)

# returns category, value with lowest mean value
gather_info(df, category, maxm=FALSE)

# does not work
paste_info(df, category)

Any help is much appreciated. Thank you!

CodePudding user response:

The issue is that inside paste_info you have to use {{ to pass the grouping column grp.col to gather_info as well as when you call pull. This is for the same reason why you have to use {{ in group_by inside gather_info

In some sense {{ translates e.g. gather_info(df, {{ grp.col }}, maxm = TRUE) to gather_info(df, category, maxm = TRUE), i.e. you pass category to gather_info. Without {{ the column name stored in grp.col will not be "injected" into the expression or function call. Hence, gather_info will take grp.col as is and interprets it as the name of the grouping column. But as there I no column with name grp.col in your data you get an error.

For more info on why {{ is needed see What is data-masking and why do I need {{?.

library(dplyr)

paste_info <- function(df, grp.col) {
  high_df <- gather_info(df, {{ grp.col }}, maxm = TRUE)
  low_df <- gather_info(df, {{ grp.col }}, maxm = FALSE)

  paste0(
    "Given your data, your best performing group is ",
    high_df |> pull({{ grp.col }}), " scoring ", high_df$percentage, "%",
    " and your worst performing group is ",
    low_df |> pull({{ grp.col }}), " scoring ", low_df$percentage, "%."
  )
}

paste_info(df, category)
#> [1] "Given your data, your best performing group is Cat1 scoring 90% and your worst performing group is Cat2 scoring 20%."
  • Related