Home > Software design >  How to remove outliers by columns in R
How to remove outliers by columns in R

Time:10-30

I have this data frame.

IQ sleep GRE happiness
105 70 200 15
40 50 150 15
70 20 70 10
150 150 80 6
148 60 900 7
115 10 1200 40
110 90 15 5
120 40 60 12
99 30 70 15
1000 15 30 68
70 60 12 70

I would like to remove the outliers for each variable. I do not want to delete a whole row if one value is identified an outlier. For example, let's say the outlier for IQ is 40, I just want to delete 40, I don't want a whole row deleted.

If I define any values that are > mean * 3sd and < mean - 3sd as outliers, what are the codes I can use to run it? If I can achieve this using Dplyr and subset, that would be great

I would expect something like this

IQ sleep GRE happiness
105 70 200 15
50 150 15
70 20 70 10
150 80 6
148 60 900 7
115 40
110 90 5
120 40 60 12
99 30 70 15
15 30 68
70 60 12 70

I have tried the remove_sd_outlier code (from dataPreperation package) and it deleted an entire row of data. I do not want this.

CodePudding user response:

I think you could rephrase the nested ifelse() as case_when() for something easier to read, but hopefully this works for you.

df %>%
  mutate(across(everything(),
                ~ ifelse(. > (mean(.)   3*sd(.)),
                         "",
                         ifelse(. < (mean(.) - 3*sd(.)),
                                "", 1*(.)))))

CodePudding user response:

You can use scale() to compute z-scores and across() to apply across all numeric variables. Note none of your example values are > 3 SD from the mean, so I used 2 SD as the threshold for demonstration.

library(dplyr)

df1 %>% 
  mutate(across(
    where(is.numeric),
    ~ ifelse(
      abs(as.numeric(scale(.x))) > 2,
      NA, 
      .x
    )
  ))
# A tibble: 11 × 4
      IQ sleep   GRE happiness
   <dbl> <dbl> <dbl>     <dbl>
 1   105    70   200        15
 2    40    50   150        15
 3    70    20    70        10
 4   150    NA    80         6
 5   148    60   900         7
 6   115    10    NA        40
 7   110    90    15         5
 8   120    40    60        12
 9    99    30    70        15
10    NA    15    30        68
11    70    60    12        70
  • Related