I'm looking for some help with a problem where the mutate function within a function I'm writing doesn't seem to be applying by group as I need it to. I'm new to R, so expect this is a basic problem, but I haven't managed to find an answer by searching.
I'm trying to write a function to remove outliers from my dataset. The threshhold to be considered an outlier is defined individually for each participant as 1.5* the participants interquartile range plus the participant's upper quartile. The function I have written is below. I'm trying to get it to add a column of Booleans to indicate whether the observation is an outlier, and a column showing the outlier threshold used for that calculation.
# Split into distance groups and mark outliers. Takes a data frame or tibble
# as data. Column determines which column to check for outliers, participant
# says which column to group observations by. Outliers are defined by Tukey's
# definition (Q3 1.5*IQR)
# Split into distance groups and mark outliers. Takes a data frame or tibble
# as input. Column determines which column to check for outliers, participant
# says which column to group observations by. Outliers are defined by Tukey's
# definition (Q3 1.5*IQR)
mark_outliers <- function(data, column, participant){
library (dplyr)
define_outlier <- function(column){
out_define <- quantile(column, probs = 0.75) 1.5*IQR(column)
out_define
}
as_tibble(data) %>% group_by(participant) %>%
mutate(is_outlier = (column > define_outlier(column)),
outlier_threshhold = define_outlier(column))
}
When I run the function however, I get the error
Error in `mutate()`:
! Problem while computing `is_outlier = (column > define_outlier(column))`.
✖ `is_outlier` must be size 160 or 1, not 3520.
ℹ The error occurred in group 1: participant = "ANNI".
Run `rlang::last_error()` to see where the error occurred.
My entire dataset size is 3520, and the group size is 160, which indicates that mutate is trying to apply the function to the entire dataset, not just the group it has been handed.
What have I done wrong? my problem seems to be similar to the one in this question dplyr mutate not applying to individual element of field, but i've tried using the Vectorize function on both define_outlier and mark_outliers with no change in output. I wonder if the problem is with how the define_outlier function is written, but can't work out what I should do differently.
Update
If I run the function giving it the actual name of the column, rather than the variable column
, I get the correct answer.
mark_outliers <- function(data, column, participant){
library (dplyr)
#add a column with the outlier definition for each participant
as_tibble(data) %>% group_by(participant) %>%
mutate(outlier_threshhold =
quantile(answer_response.rt, probs = 0.75)
1.5*IQR(answer_response.rt)
It seems that when the variable column
is passed, it passes the global version rather than the grouped one. Is there a way around that?
CodePudding user response:
Welcome to R! Perhaps go with group_modify
rather than mutate
. This will apply the function by factor level (or group) to a grouped tibble. See:
https://dplyr.tidyverse.org/reference/group_map.html
dplyr::group_map(.data, .f, ..., .keep = FALSE)
CodePudding user response:
Here I add in the use of the {{ }} or "embrace" operator to control the context in which dplyr interprets the column names in the function.
library (dplyr)
define_outlier <- function(column){
quantile( {{column}}, probs = 0.75) 1.5*IQR( {{column}} )
}
mark_outliers <- function(data, column, participant){
as_tibble(data) %>%
group_by( {{participant}} ) %>%
mutate(outlier_threshold = define_outlier( {{column}} )) %>%
ungroup() %>%
mutate(is_outlier = {{column}} > outlier_threshold)
}
Now let's test on some altered data:
mtcars_alt <- mtcars
mtcars_alt[1,1] = 50 # Making first car's mpg 50, an outlier
mark_outliers(mtcars_alt, mpg, gear)
I'm not sure if it's working as you want, but I am getting different values of outlier threshold for each gear.
# A tibble: 32 × 13
mpg cyl disp hp drat wt qsec vs am gear carb outlier_threshold is_outlier
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 50 6 160 110 3.9 2.62 16.5 0 1 4 4 45.3 TRUE
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 45.3 FALSE
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 45.3 FALSE
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 24.2 FALSE
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 24.2 FALSE
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 24.2 FALSE
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 24.2 FALSE
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 45.3 FALSE
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 45.3 FALSE
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 45.3 FALSE
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows