Home > Software design >  Efficiently binarize a dataframe by whether a row is above some percentile of its column
Efficiently binarize a dataframe by whether a row is above some percentile of its column

Time:10-03

Trying to write a function that transforms a dataframe by high pass filtering each row entry by some percentile of the column values. The function is written for single cell RNA-sequencing data but in principle anything works. Transposes it at the end because it makes some downstream code cleaner.

topquantile.binarize <- function(scRNAseq_data, percentile){ 
  # takes in data that is gene by cell
  # returns dataframe of cell by gene
  # calculates quantile for each gene
  # if a gene in a cell is in the top 90th quantile 
  # that gene is accepted  
  for (i in c(1:dim(scRNAseq_data)[1])){
    filter_value <- quantile(scRNAseq_data[i,], percentile)
    filter_value <- as.numeric(filter_value)
    high_pass <- function(x) {
      if (x > filter_value) {
        x <- 1
      } else {
        x <- 0
      }
      return(x)
    }
    scRNAseq_data[i, ] <- apply(scRNAseq_data[i, ], 2, high_pass)
  }
  return(t(scRNAseq_data))
}

EXAMPLE DATA

library(tictoc)
tic()
set.seed(42)
scRNAseq_data <- data.frame(matrix(rnorm(1000*100, mean=0, sd=1), 1000, 100))
res <- topquantile.binarize(scRNAseq_data, 0.9)
toc()

You will notice that even at 100 columns each with 1000 rows its running pretty slow, using tictoc you'll see it takes around 4 seconds (possibly a little more to do that.
I realize that technically the function does more than just look for values in the top quantile but whatever.

CodePudding user response:

Use matrixStats::rowQuantiles and exploit the vectorization of the R language. Runs in the blink of an eye.

res1 <- t( (scRNAseq_data > matrixStats::rowQuantiles(as.matrix(scRNAseq_data), probs=.9)))

stopifnot(all.equal(res, res1))

MatrixGenerics::rowQuantiles from bioconductor might also work.

  • Related