I have this data frame.
IQ | sleep | GRE | happiness |
---|---|---|---|
105 | 70 | 200 | 15 |
40 | 50 | 150 | 15 |
70 | 20 | 70 | 10 |
150 | 150 | 80 | 6 |
148 | 60 | 900 | 7 |
115 | 10 | 1200 | 40 |
110 | 90 | 15 | 5 |
120 | 40 | 60 | 12 |
99 | 30 | 70 | 15 |
1000 | 15 | 30 | 68 |
70 | 60 | 12 | 70 |
I would like to remove the outliers for each variable. I do not want to delete a whole row if one value is identified an outlier. For example, let's say the outlier for IQ is 40, I just want to delete 40, I don't want a whole row deleted.
If I define any values that are > mean * 3sd and < mean - 3sd as outliers, what are the codes I can use to run it? If I can achieve this using Dplyr and subset, that would be great
I would expect something like this
IQ | sleep | GRE | happiness |
---|---|---|---|
105 | 70 | 200 | 15 |
50 | 150 | 15 | |
70 | 20 | 70 | 10 |
150 | 80 | 6 | |
148 | 60 | 900 | 7 |
115 | 40 | ||
110 | 90 | 5 | |
120 | 40 | 60 | 12 |
99 | 30 | 70 | 15 |
15 | 30 | 68 | |
70 | 60 | 12 | 70 |
I have tried the remove_sd_outlier code (from dataPreperation package) and it deleted an entire row of data. I do not want this.
CodePudding user response:
I think you could rephrase the nested ifelse() as case_when() for something easier to read, but hopefully this works for you.
df %>%
mutate(across(everything(),
~ ifelse(. > (mean(.) 3*sd(.)),
"",
ifelse(. < (mean(.) - 3*sd(.)),
"", 1*(.)))))
CodePudding user response:
You can use scale()
to compute z-scores and across()
to apply across all numeric variables. Note none of your example values are > 3 SD from the mean, so I used 2 SD as the threshold for demonstration.
library(dplyr)
df1 %>%
mutate(across(
where(is.numeric),
~ ifelse(
abs(as.numeric(scale(.x))) > 2,
NA,
.x
)
))
# A tibble: 11 × 4
IQ sleep GRE happiness
<dbl> <dbl> <dbl> <dbl>
1 105 70 200 15
2 40 50 150 15
3 70 20 70 10
4 150 NA 80 6
5 148 60 900 7
6 115 10 NA 40
7 110 90 15 5
8 120 40 60 12
9 99 30 70 15
10 NA 15 30 68
11 70 60 12 70