dplyr: mutate not applying function to grouped data-CodePudding

I'm looking for some help with a problem where the mutate function within a function I'm writing doesn't seem to be applying by group as I need it to. I'm new to R, so expect this is a basic problem, but I haven't managed to find an answer by searching.

I'm trying to write a function to remove outliers from my dataset. The threshhold to be considered an outlier is defined individually for each participant as 1.5* the participants interquartile range plus the participant's upper quartile. The function I have written is below. I'm trying to get it to add a column of Booleans to indicate whether the observation is an outlier, and a column showing the outlier threshold used for that calculation.

# Split into distance groups and mark outliers. Takes a data frame or tibble
# as data. Column determines which column to check for outliers, participant
# says which column to group observations by. Outliers are defined by Tukey's
# definition (Q3   1.5*IQR)

# Split into distance groups and mark outliers. Takes a data frame or tibble
# as input. Column determines which column to check for outliers, participant
# says which column to group observations by. Outliers are defined by Tukey's
# definition (Q3   1.5*IQR)

mark_outliers <- function(data, column, participant){
        library (dplyr)
        define_outlier <- function(column){
                out_define <- quantile(column, probs = 0.75)   1.5*IQR(column)
                out_define
        }
        
        as_tibble(data) %>% group_by(participant) %>%
        mutate(is_outlier = (column > define_outlier(column)), 
               outlier_threshhold = define_outlier(column))
        
        }

When I run the function however, I get the error

Error in `mutate()`:
! Problem while computing `is_outlier = (column > define_outlier(column))`.
✖ `is_outlier` must be size 160 or 1, not 3520.
ℹ The error occurred in group 1: participant = "ANNI".
Run `rlang::last_error()` to see where the error occurred.

My entire dataset size is 3520, and the group size is 160, which indicates that mutate is trying to apply the function to the entire dataset, not just the group it has been handed.

What have I done wrong? my problem seems to be similar to the one in this question dplyr mutate not applying to individual element of field, but i've tried using the Vectorize function on both define_outlier and mark_outliers with no change in output. I wonder if the problem is with how the define_outlier function is written, but can't work out what I should do differently.

Update If I run the function giving it the actual name of the column, rather than the variable column, I get the correct answer.

mark_outliers <- function(data, column, participant){
        library (dplyr)
        #add a column with the outlier definition for each participant
        as_tibble(data) %>% group_by(participant) %>%
                mutate(outlier_threshhold =
                               quantile(answer_response.rt, probs = 0.75)   
                               1.5*IQR(answer_response.rt)

It seems that when the variable column is passed, it passes the global version rather than the grouped one. Is there a way around that?

CodePudding user response：

Welcome to R! Perhaps go with group_modify rather than mutate. This will apply the function by factor level (or group) to a grouped tibble. See:

https://dplyr.tidyverse.org/reference/group_map.html

dplyr::group_map(.data, .f, ..., .keep = FALSE)

CodePudding user response：

Here I add in the use of the {{ }} or "embrace" operator to control the context in which dplyr interprets the column names in the function.

library (dplyr)
define_outlier <- function(column){
  quantile( {{column}}, probs = 0.75)   1.5*IQR( {{column}} )
}

mark_outliers <- function(data, column, participant){
  as_tibble(data) %>% 
    group_by( {{participant}} ) %>%
    mutate(outlier_threshold = define_outlier( {{column}} )) %>%
    ungroup() %>%
    mutate(is_outlier = {{column}} > outlier_threshold)
}

Now let's test on some altered data:

mtcars_alt <- mtcars
mtcars_alt[1,1] = 50 # Making first car's mpg 50, an outlier
mark_outliers(mtcars_alt, mpg, gear)

I'm not sure if it's working as you want, but I am getting different values of outlier threshold for each gear.

# A tibble: 32 × 13
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb outlier_threshold is_outlier
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>             <dbl> <lgl>     
 1  50       6  160    110  3.9   2.62  16.5     0     1     4     4              45.3 TRUE      
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4              45.3 FALSE     
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1              45.3 FALSE     
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1              24.2 FALSE     
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2              24.2 FALSE     
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1              24.2 FALSE     
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4              24.2 FALSE     
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2              45.3 FALSE     
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2              45.3 FALSE     
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4              45.3 FALSE     
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows