I have a data frame with 431 variables and 140 observations and I need to remove outliers. However this dataset has several NA values, and I do not want to remove all rows with NAs. I am trying to do this outlier removal by IQR method, and so far, I've been able to obtain quartiles and IQR by the following:
data <- df2[,4:434]
apply(data,2,quantile, probs=c(0.25,0.75), na.rm=TRUE) -> Quartiles
sapply(data,IQR, na.rm=TRUE) -> iqr
I've also calculated the lower and upper values for each of my columns:
Lower <- Quartiles[1,]-1.5*iqr
Upper <- Quartiles[2,] 1.5*iqr
However, when I have tried to replace the outliers by NAs, no change has been observed in my data frame:
data_no_outlier <- replace(data, data[1:431] < Lower & data[1:431] > Upper, NA)
I have also tried to use this script to the iris data with the same unsuccessful result:
data(iris, package = "datasets")
completeData <- iris[-5]
apply(completeData,2,quantile, probs=c(0.25,0.75), na.rm=TRUE) -> Quartiles
sapply(completeData,IQR, na.rm=TRUE) -> iqr
Lower <- Quartiles[1,]-1.5*iqr
Upper <- Quartiles[2,] 1.5*iqr
data_no_outlier <- replace(completeData, completeData < Lower & completeData > Upper, NA)
Is there any way I can filter out outliers from my data, that does not require to manually select all the columns by name?
CodePudding user response:
Here's one method:
fun <- function(z, fac = 1.5, na.rm = TRUE) {
Q <- quantile(z, c(0.25, 0.75), na.rm = na.rm)
R <- IQR(z, na.rm = na.rm)
z[z < Q[1] - fac * R | z > Q[2] fac * R] <- NA
z
}
Sample data:
set.seed(42)
quux <- data.frame(ltr = letters[1:10], num1 = c(99, runif(9)), num2 = c(runif(9), 99))
quux
# ltr num1 num2
# 1 a 99.0000000 0.7050648
# 2 b 0.9148060 0.4577418
# 3 c 0.9370754 0.7191123
# 4 d 0.2861395 0.9346722
# 5 e 0.8304476 0.2554288
# 6 f 0.6417455 0.4622928
# 7 g 0.5190959 0.9400145
# 8 h 0.7365883 0.9782264
# 9 i 0.1346666 0.1174874
# 10 j 0.6569923 99.0000000
dplyr
library(dplyr)
quux %>%
mutate(across(where(is.numeric), fun))
# ltr num1 num2
# 1 a NA 0.7050648
# 2 b 0.9148060 0.4577418
# 3 c 0.9370754 0.7191123
# 4 d 0.2861395 0.9346722
# 5 e 0.8304476 0.2554288
# 6 f 0.6417455 0.4622928
# 7 g 0.5190959 0.9400145
# 8 h 0.7365883 0.9782264
# 9 i 0.1346666 0.1174874
# 10 j 0.6569923 NA
base R
isnum <- sapply(quux, is.numeric)
quux[isnum] <- lapply(quux[isnum], fun)
quux
# ltr num1 num2
# 1 a NA 0.7050648
# 2 b 0.9148060 0.4577418
# 3 c 0.9370754 0.7191123
# 4 d 0.2861395 0.9346722
# 5 e 0.8304476 0.2554288
# 6 f 0.6417455 0.4622928
# 7 g 0.5190959 0.9400145
# 8 h 0.7365883 0.9782264
# 9 i 0.1346666 0.1174874
# 10 j 0.6569923 NA