How to automatically create columns to identify outliers for each numeric variable-CodePudding

I want to automatically create columns for the outliers for each variable. The column to identify the outliers of each variable must be contiguous to the variable concerned. The value of the variable must be either yes or no. Is it possible to automate this?

ID<-1:10
    Weight<-c(65.1,70.3, 22, 45, 150,68.5,87.2,66.4,59.2,72.3)
    Sex<-c("M","F","F","F","M","F","M","M","F","F")
    Height<-c(1.3, 1.65, 1.75, 1.86, 1.79, 1.76, 1.1, 2.65, 1.75,
              1.65)
    City= head(LETTERS,10)
    
    Income<- c(1200,2000,2100,2550,12000,800,3000,2400,1895,2300)
    
    mydata2<-data.frame(ID,Weight,Sex,Height,City,Income)

I use the function Outlier {DescTools} to identify the outliers

    Outlier(mydata2$Weight)
[1]  22 150

    Outlier(mydata2$Height)
[1] 1.30 1.10 2.65

    Outlier(mydata2$Income)
[1] 12000   800

This the expected dataset:

Weight_outlier come just after Weight, Height_outlier after Height and so on.

I have dozen of numeric variables in my real dataset

   ID Weight Weight_outlier Sex Height Height_outlier City Income Income_outlier
1   1   65.1             no   M   1.30            yes    A   1200             no
2   2   70.3             no   F   1.65             no    B   2000             no
3   3   22.0            yes   F   1.75             no    C   2100             no
4   4   45.0             no   F   1.86             no    D   2550             no
5   5  150.0            yes   M   1.79             no    E  12000            yes
6   6   68.5             no   F   1.76             no    F    800            yes
7   7   87.2             no   M   1.10            yes    G   3000             no
8   8   66.4             no   M   2.65            yes    H   2400             no
9   9   59.2             no   F   1.75             no    I   1895             no
10 10   72.3             no   F   1.65             no    J   2300             no

CodePudding user response：

You can use a for loop to build a copy of your dataframe column by column, inserting _outlier columns if the original column is numeric.

library(DescTools)

mydata3 <- mydata2[, "ID", drop = FALSE]

for (cname in names(mydata2[, -1])) {
  mydata3[[cname]] <- mydata2[[cname]]
  if (is.numeric(mydata2[[cname]])) {
    outliers <- rep("no", nrow(mydata2))
    outliers[Outlier(mydata2[[cname]], value = FALSE)] <- "yes"
    outliers[is.na(mydata2[[cname]])] <- NA
    mydata3[[paste0(cname, "_outlier")]] <- outliers
  }
}

mydata3


   ID Weight Weight_outlier Sex Height Height_outlier City Income
1   1   65.1             no   M   1.30            yes    A   1200
2   2   70.3             no   F   1.65             no    B   2000
3   3   22.0            yes   F   1.75             no    C   2100
4   4   45.0             no   F   1.86             no    D   2550
5   5  150.0            yes   M   1.79             no    E  12000
6   6   68.5             no   F   1.76             no    F    800
7   7   87.2             no   M   1.10            yes    G   3000
8   8   66.4             no   M   2.65            yes    H   2400
9   9   59.2             no   F   1.75             no    I   1895
10 10   72.3             no   F   1.65             no    J   2300
   Income_outlier
1              no
2              no
3              no
4              no
5             yes
6             yes
7              no
8              no
9              no
10             no

CodePudding user response：

mutate(across... will identify values that match outliers, then relocate will place in desired order, used with reduce from purrr

library(tidyverse)
# Get the columns you wish to test
outlier_vars <- mydata2 %>% dplyr::select(Weight, Height, Income) %>% names
mydata2 %>% 
# Identify values that match the outliers in each column, renaming
  mutate(across({outlier_vars}, ~.x %in% DescTools::Outlier(.x), .names = "{.col}_outlier")) %>% 
# Use reduce with relocate
reduce2(
  .x = outlier_vars,
  .y = paste0(outlier_vars, "_outlier"),
  .f = ~ relocate(..1, ..3, .after = ..2),
  .init = .
)