Trying to write a function that transforms a dataframe by high pass filtering each row entry by some percentile of the column values. The function is written for single cell RNA-sequencing data but in principle anything works. Transposes it at the end because it makes some downstream code cleaner.
topquantile.binarize <- function(scRNAseq_data, percentile){
# takes in data that is gene by cell
# returns dataframe of cell by gene
# calculates quantile for each gene
# if a gene in a cell is in the top 90th quantile
# that gene is accepted
for (i in c(1:dim(scRNAseq_data)[1])){
filter_value <- quantile(scRNAseq_data[i,], percentile)
filter_value <- as.numeric(filter_value)
high_pass <- function(x) {
if (x > filter_value) {
x <- 1
} else {
x <- 0
}
return(x)
}
scRNAseq_data[i, ] <- apply(scRNAseq_data[i, ], 2, high_pass)
}
return(t(scRNAseq_data))
}
EXAMPLE DATA
library(tictoc)
tic()
set.seed(42)
scRNAseq_data <- data.frame(matrix(rnorm(1000*100, mean=0, sd=1), 1000, 100))
res <- topquantile.binarize(scRNAseq_data, 0.9)
toc()
You will notice that even at 100 columns each with 1000 rows its running pretty slow, using tictoc
you'll see it takes around 4 seconds (possibly a little more to do that.
I realize that technically the function does more than just look for values in the top quantile but whatever.
CodePudding user response:
Use matrixStats::rowQuantiles
and exploit the vectorization of the R language. Runs in the blink of an eye.
res1 <- t( (scRNAseq_data > matrixStats::rowQuantiles(as.matrix(scRNAseq_data), probs=.9)))
stopifnot(all.equal(res, res1))
MatrixGenerics::rowQuantiles
from bioconductor might also work.