Home > Back-end >  Unable to write a correct user defined function in dplyr for outlier treatment in r
Unable to write a correct user defined function in dplyr for outlier treatment in r

Time:08-20

I am trying to write a function to fix outliers in variables but getting errors when writing in dplyr form.

fn_outlier_fix <- function(x, df){
  x = enquo(x)
  
  Q1 = df %>% pull(!!x) %>% quantile(0.25) %>% unname()
  Q3 = df %>% pull(!!x) %>% quantile(0.75) %>% unname()
  IQR = Q3 - Q1
  UC = Q3   (1.5 * IQR)
  LC = Q3 - (1.5 * IQR)
  
  df <- df %>% 
    mutate(!!x := if_else(x > UC,UC,!!x),
           !!x := if_else(x < LC,LC,!!x))
}
library(dplyr)

df_test <- tribble(
  ~sales, ~var1, ~var2,
  22, 230.1,  37.8,
  10, 44.5,  39.3,
  9,  17.2,  45.9,
  19, 151.5,  41.3,
  13, 180.8,  10.8,
  7,  8.7,    48.9,
  12, 57.5,   32.8,
  13, 120.2,  19.6,
  5,  8.6,    2.1,
  11, 199.8,  2.6)
fn_outlier_fix(x = var1, df = df_test)

Error:

Error in `mutate()`:
! Problem while computing `var1 = if_else(x > UC, UC, var1)`.
Caused by error in `if_else()`:
! Base operators are not defined for quosures. Do you need to unquote the quosure?

# Bad: myquosure > rhs

# Good: !!myquosure > rhs
Backtrace:
 1. global fn_outlier_fix(x = var1, df = df_test)
 9. rlang:::Ops.quosure(x, UC)

I don't know why its so complicated in r dplyr to write functions in comparison to Python. I was able to manage write the function in below form that worked but I still want the above code to work for my understanding. Appreciate any help.

Where as below code in base R works

fn_outlier_fix <- function(x){
  
  Q1 = quantile(x, 0.25)
  Q3 = quantile(x, 0.75)
  IQR = Q3 - Q1
  UC = Q3   (1.5 * IQR)
  LC = Q3 - (1.5 * IQR)
  
  x[x > UC] <- UC
  x[x < LC] <- LC
  
  x <- x
}

CodePudding user response:

You were nearly there, you've just forgotten to unquote the x in the if_else statement. This function works:

fn_outlier_fix <- function(x, df){
  x = enquo(x)
  
  Q1 = df %>% pull(!!x) %>% quantile(0.25) %>% unname()
  Q3 = df %>% pull(!!x) %>% quantile(0.75) %>% unname()
  IQR = Q3 - Q1
  UC = Q3   (1.5 * IQR)
  LC = Q3 - (1.5 * IQR)
  
  df <- df %>% 
    mutate(!!x := if_else(!!x > UC,UC,!!x),
           !!x := if_else(!!x < LC,LC,!!x))
  
  df
}

The reason why writing functions for dplyr is so complicated is due to the non standard evaluation it uses to access the variable names. There is a complete vignette about programming with dplyr.

They've changed the recommend way again how to work with NSE in dplyr, now best practise would look like:

fn_outlier_fix_2 <- function(x, df){
  
  Q1 = df %>% pull({{x}}) %>% quantile(0.25) %>% unname()
  Q3 = df %>% pull({{x}}) %>% quantile(0.75) %>% unname()
  IQR = Q3 - Q1
  UC = Q3   (1.5 * IQR)
  LC = Q3 - (1.5 * IQR)
  
  df <- df %>% 
    mutate({{x}} := if_else({{x}} > UC,UC,{{x}}),
           {{x}} := if_else({{x}} < LC,LC,{{x}}))
  
  df
}
  • Related