Evaluate a dataframe based on more than one condition in R-CodePudding

I have a dataframe counts (60,660 x 1246):

               sample1          sample2          sample3          sample4          sample5
gene1       1615.75292       663.200093       2406.15320        836.38076        1217.8192
gene2         41.93247         8.602831         12.62244         60.14423          22.7755
gene3        697.97280      1198.139790       1033.46252        259.37201         695.9924
gene4        678.35922      1114.457703       1281.96687        466.11782        1265.3798
gene5        365.21832       726.548215        781.80257        268.76955         476.9457

I'm trying to find the list of genes per sample that fit within a certain threshold. For example, in order to find the genes that have a value greater than 1001, I can use counts > 1001 which gives me a TRUE/FALSE matrix:

               sample1          sample2          sample3          sample4          sample5
gene1             TRUE            FALSE             TRUE            FALSE             TRUE
gene2            FALSE            FALSE            FALSE            FALSE            FALSE
gene3            FALSE             TRUE             TRUE            FALSE            FALSE
gene4            FALSE             TRUE             TRUE            FALSE             TRUE
gene5            FALSE            FALSE            FALSE            FALSE            FALSE

Which I then pass to apply(true_false_matrix, 2, which) %>% lapply(\(x) names(x)) to get a list of the genes per sample that have a value greater than 1001. I would also like to find genes whose value is in between a certain range. For example, I tried to do:

1 < counts && counts < 5

But all I got was a single value of FALSE

I know that there are genes meeting this requirement so I think I'm going about finding them in the wrong way. Is there a way to get a TRUE/FALSE matrix from my initial dataframe but with 2 conditions?

CodePudding user response：

First, you need & instead of &&,

1 < counts & counts < 1000
#       sample1 sample2 sample3 sample4 sample5
# gene1   FALSE    TRUE   FALSE    TRUE   FALSE
# gene2    TRUE    TRUE    TRUE    TRUE    TRUE
# gene3    TRUE   FALSE   FALSE    TRUE    TRUE
# gene4    TRUE   FALSE   FALSE    TRUE   FALSE
# gene5    TRUE    TRUE    TRUE    TRUE    TRUE

Second, you can use which directly, by adding arr.ind=TRUE for row/column indexes of all the TRUE parts:

ind <- which(1 < counts & counts < 1000, arr.ind = TRUE)
ind
#       row col
# gene2   2   1
# gene3   3   1
# gene4   4   1
# gene5   5   1
# gene1   1   2
# gene2   2   2
# gene5   5   2
# gene2   2   3
# gene5   5   3
# gene1   1   4
# gene2   2   4
# gene3   3   4
# gene4   4   4
# gene5   5   4
# gene2   2   5
# gene3   3   5
# gene5   5   5

While the row names here are clear, you can get which genes are affected two ways:

rownames(ind)
#  [1] "gene2" "gene3" "gene4" "gene5" "gene1" "gene2" "gene5" "gene2" "gene5" "gene1" "gene2" "gene3" "gene4" "gene5"
# [15] "gene2" "gene3" "gene5"

rownames(counts)[ ind[,"row"] ]
#  [1] "gene2" "gene3" "gene4" "gene5" "gene1" "gene2" "gene5" "gene2" "gene5" "gene1" "gene2" "gene3" "gene4" "gene5"
# [15] "gene2" "gene3" "gene5"

And since you said you wanted to do something per-sample, you can do

split(rownames(ind), colnames(counts)[ ind[,"col"] ])
# $sample1
# [1] "gene2" "gene3" "gene4" "gene5"
# $sample2
# [1] "gene1" "gene2" "gene5"
# $sample3
# [1] "gene2" "gene5"
# $sample4
# [1] "gene1" "gene2" "gene3" "gene4" "gene5"
# $sample5
# [1] "gene2" "gene3" "gene5"

CodePudding user response：

The double && might be the issue, since it stops evaluation if LHS is not TRUE, use a single one.

Your code you can write more concise:

apply(counts > 1001, 2, which) |> lapply(names)
apply(counts > 1001 & counts < 1200, 2, which) |> lapply(names)