Home > Enterprise >  How to avoid looping over rows and columns to increase speed in R
How to avoid looping over rows and columns to increase speed in R

Time:10-15

I am a new R user, and need to use the software for my first job. I tried looking for a similar issue to mine on the website, but haven't found one. Apologies if my question is redundant.

The problem I have is that I need to edit outliers in every column. A reproduceable example is below:

    data_X <- matrix(data = rep(1,100), nrow = 10, ncol = 10)

for (i in 1:nrow(data_x)) {
  for (j in 1:ncol(data_x)) {
    if (is.na(data_x[i,j])) {
      data_x[i,j] <- NA
    } else if (data_x[i,j]>(quantile(data_x[[j]], 0.75, na.rm=T) 1.5*(quantile(data_x[[j]], 0.75,na.rm=T)-quantile(data_x[[j]], 0.25,na.rm=T)))) {
      data_x[i,j]=(quantile(data_x[[j]], 0.5, na.rm=T))
    } else if (data_x[i,j]<(quantile(data_x[[j]], 0.25, na.rm=T)-1.5*(quantile(data_x[[j]], 0.75, na.rm=T)-quantile(data_x[[j]], 0.25, na.rm=T)))) {
      data_x[i,j]=(quantile(data_x[[j]], 0.5, na.rm=T))
    } else {
      data_x[i,j]=data_x[i,j]
    }
  }
}

In reality, the matrix is of a much larger dimension, and it takes about 4 minutes to loop through the code. This is way too long for my purposes, and I wonder if there is a more elegant way.

I have done some research, and apparently apply() would not improve speed...

Edit:

Rules:

Datapoints above the 75% quantile 1.5 * The interquartile spread;

and

Datapoints below the 25% quantile - 1.5 * The interquantile spread;

Are converted to the median.

CodePudding user response:

1.We create a rule function where we make use of the vectorized ifelse.

rule_function <- function(x) {
  
  q25 <- quantile(x, 0.25, na.rm = TRUE)
  q75 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q75 - q25
  lower <- q25 - 1.5 * iqr
  upper <- q75   1.5 * iqr
  
  result <- ifelse(x < lower | x > upper, median(x, na.rm = TRUE), x)

  return(result)  
}

2.And then we apply the function to each column of the matrix:

apply(data_X, 2, rule_function)

The example data doesn't really allow testing, so I am not 100% sure if this helps you or not. However, this took only a few seconds for a 10000 x 10000 matrix (if that is good or not depends on your actual usecase ;)

  • Related