Home > Blockchain >  Can I remove outliers from all columns in my dataframe R?
Can I remove outliers from all columns in my dataframe R?

Time:09-30

I have a data frame with 431 variables and 140 observations and I need to remove outliers. However this dataset has several NA values, and I do not want to remove all rows with NAs. I am trying to do this outlier removal by IQR method, and so far, I've been able to obtain quartiles and IQR by the following:

data <- df2[,4:434]
apply(data,2,quantile, probs=c(0.25,0.75), na.rm=TRUE) -> Quartiles
sapply(data,IQR, na.rm=TRUE) -> iqr

I've also calculated the lower and upper values for each of my columns:

Lower <- Quartiles[1,]-1.5*iqr
Upper <- Quartiles[2,] 1.5*iqr

However, when I have tried to replace the outliers by NAs, no change has been observed in my data frame:

data_no_outlier <- replace(data, data[1:431] < Lower  & data[1:431] > Upper, NA)

I have also tried to use this script to the iris data with the same unsuccessful result:

data(iris, package = "datasets")
completeData <- iris[-5]
apply(completeData,2,quantile, probs=c(0.25,0.75), na.rm=TRUE) -> Quartiles
sapply(completeData,IQR, na.rm=TRUE) -> iqr

Lower <- Quartiles[1,]-1.5*iqr
Upper <- Quartiles[2,] 1.5*iqr

data_no_outlier <- replace(completeData, completeData < Lower & completeData > Upper, NA)

Is there any way I can filter out outliers from my data, that does not require to manually select all the columns by name?

CodePudding user response:

Here's one method:

fun <- function(z, fac = 1.5, na.rm = TRUE) {
  Q <- quantile(z, c(0.25, 0.75), na.rm = na.rm)
  R <- IQR(z, na.rm = na.rm)
  z[z < Q[1] - fac * R | z > Q[2]   fac * R] <- NA
  z
}

Sample data:

set.seed(42)
quux <- data.frame(ltr = letters[1:10], num1 = c(99, runif(9)), num2 = c(runif(9), 99))
quux
#    ltr       num1       num2
# 1    a 99.0000000  0.7050648
# 2    b  0.9148060  0.4577418
# 3    c  0.9370754  0.7191123
# 4    d  0.2861395  0.9346722
# 5    e  0.8304476  0.2554288
# 6    f  0.6417455  0.4622928
# 7    g  0.5190959  0.9400145
# 8    h  0.7365883  0.9782264
# 9    i  0.1346666  0.1174874
# 10   j  0.6569923 99.0000000

dplyr

library(dplyr)
quux %>%
  mutate(across(where(is.numeric), fun))
#    ltr      num1      num2
# 1    a        NA 0.7050648
# 2    b 0.9148060 0.4577418
# 3    c 0.9370754 0.7191123
# 4    d 0.2861395 0.9346722
# 5    e 0.8304476 0.2554288
# 6    f 0.6417455 0.4622928
# 7    g 0.5190959 0.9400145
# 8    h 0.7365883 0.9782264
# 9    i 0.1346666 0.1174874
# 10   j 0.6569923        NA

base R

isnum <- sapply(quux, is.numeric)
quux[isnum] <- lapply(quux[isnum], fun)
quux
#    ltr      num1      num2
# 1    a        NA 0.7050648
# 2    b 0.9148060 0.4577418
# 3    c 0.9370754 0.7191123
# 4    d 0.2861395 0.9346722
# 5    e 0.8304476 0.2554288
# 6    f 0.6417455 0.4622928
# 7    g 0.5190959 0.9400145
# 8    h 0.7365883 0.9782264
# 9    i 0.1346666 0.1174874
# 10   j 0.6569923        NA
  • Related