I am a new R user, and need to use the software for my first job. I tried looking for a similar issue to mine on the website, but haven't found one. Apologies if my question is redundant.
The problem I have is that I need to edit outliers in every column. A reproduceable example is below:
data_X <- matrix(data = rep(1,100), nrow = 10, ncol = 10)
for (i in 1:nrow(data_x)) {
for (j in 1:ncol(data_x)) {
if (is.na(data_x[i,j])) {
data_x[i,j] <- NA
} else if (data_x[i,j]>(quantile(data_x[[j]], 0.75, na.rm=T) 1.5*(quantile(data_x[[j]], 0.75,na.rm=T)-quantile(data_x[[j]], 0.25,na.rm=T)))) {
data_x[i,j]=(quantile(data_x[[j]], 0.5, na.rm=T))
} else if (data_x[i,j]<(quantile(data_x[[j]], 0.25, na.rm=T)-1.5*(quantile(data_x[[j]], 0.75, na.rm=T)-quantile(data_x[[j]], 0.25, na.rm=T)))) {
data_x[i,j]=(quantile(data_x[[j]], 0.5, na.rm=T))
} else {
data_x[i,j]=data_x[i,j]
}
}
}
In reality, the matrix is of a much larger dimension, and it takes about 4 minutes to loop through the code. This is way too long for my purposes, and I wonder if there is a more elegant way.
I have done some research, and apparently apply() would not improve speed...
Edit:
Rules:
Datapoints above the 75% quantile 1.5 * The interquartile spread;
and
Datapoints below the 25% quantile - 1.5 * The interquantile spread;
Are converted to the median.
CodePudding user response:
1.We create a rule function where we make use of the vectorized ifelse
.
rule_function <- function(x) {
q25 <- quantile(x, 0.25, na.rm = TRUE)
q75 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q75 - q25
lower <- q25 - 1.5 * iqr
upper <- q75 1.5 * iqr
result <- ifelse(x < lower | x > upper, median(x, na.rm = TRUE), x)
return(result)
}
2.And then we apply the function to each column of the matrix:
apply(data_X, 2, rule_function)
The example data doesn't really allow testing, so I am not 100% sure if this helps you or not. However, this took only a few seconds for a 10000 x 10000 matrix (if that is good or not depends on your actual usecase ;)