I want to automatically create columns for the outliers for each variable. The column to identify the outliers of each variable must be contiguous to the variable concerned. The value of the variable must be either yes or no. Is it possible to automate this?
ID<-1:10
Weight<-c(65.1,70.3, 22, 45, 150,68.5,87.2,66.4,59.2,72.3)
Sex<-c("M","F","F","F","M","F","M","M","F","F")
Height<-c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75,
1.65)
City= head(LETTERS,10)
Income<- c(1200,2000,2100,2550,12000,800,3000,2400,1895,2300)
mydata2<-data.frame(ID,Weight,Sex,Height,City,Income)
I use the function Outlier {DescTools} to identify the outliers
Outlier(mydata2$Weight)
[1] 22 150
Outlier(mydata2$Height)
[1] 1.30 1.10 2.65
Outlier(mydata2$Income)
[1] 12000 800
This the expected dataset:
Weight_outlier
come just after Weight
, Height_outlier
after Height
and so on.
I have dozen of numeric variables in my real dataset
ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
1 1 65.1 no M 1.30 yes A 1200 no
2 2 70.3 no F 1.65 no B 2000 no
3 3 22.0 yes F 1.75 no C 2100 no
4 4 45.0 no F 1.86 no D 2550 no
5 5 150.0 yes M 1.79 no E 12000 yes
6 6 68.5 no F 1.76 no F 800 yes
7 7 87.2 no M 1.10 yes G 3000 no
8 8 66.4 no M 2.65 yes H 2400 no
9 9 59.2 no F 1.75 no I 1895 no
10 10 72.3 no F 1.65 no J 2300 no
CodePudding user response:
You can use a for
loop to build a copy of your dataframe column by column, inserting _outlier
columns if the original column is numeric.
library(DescTools)
mydata3 <- mydata2[, "ID", drop = FALSE]
for (cname in names(mydata2[, -1])) {
mydata3[[cname]] <- mydata2[[cname]]
if (is.numeric(mydata2[[cname]])) {
outliers <- rep("no", nrow(mydata2))
outliers[Outlier(mydata2[[cname]], value = FALSE)] <- "yes"
outliers[is.na(mydata2[[cname]])] <- NA
mydata3[[paste0(cname, "_outlier")]] <- outliers
}
}
mydata3
ID Weight Weight_outlier Sex Height Height_outlier City Income
1 1 65.1 no M 1.30 yes A 1200
2 2 70.3 no F 1.65 no B 2000
3 3 22.0 yes F 1.75 no C 2100
4 4 45.0 no F 1.86 no D 2550
5 5 150.0 yes M 1.79 no E 12000
6 6 68.5 no F 1.76 no F 800
7 7 87.2 no M 1.10 yes G 3000
8 8 66.4 no M 2.65 yes H 2400
9 9 59.2 no F 1.75 no I 1895
10 10 72.3 no F 1.65 no J 2300
Income_outlier
1 no
2 no
3 no
4 no
5 yes
6 yes
7 no
8 no
9 no
10 no
CodePudding user response:
mutate(across... will identify values that match outliers, then relocate will place in desired order, used with reduce from purrr
library(tidyverse)
# Get the columns you wish to test
outlier_vars <- mydata2 %>% dplyr::select(Weight, Height, Income) %>% names
mydata2 %>%
# Identify values that match the outliers in each column, renaming
mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = "{.col}_outlier")) %>%
# Use reduce with relocate
reduce2(
.x = outlier_vars,
.y = paste0(outlier_vars, "_outlier"),
.f = ~ relocate(..1, ..3, .after = ..2),
.init = .
)