I have a dataframe counts
(60,660 x 1246):
sample1 sample2 sample3 sample4 sample5
gene1 1615.75292 663.200093 2406.15320 836.38076 1217.8192
gene2 41.93247 8.602831 12.62244 60.14423 22.7755
gene3 697.97280 1198.139790 1033.46252 259.37201 695.9924
gene4 678.35922 1114.457703 1281.96687 466.11782 1265.3798
gene5 365.21832 726.548215 781.80257 268.76955 476.9457
I'm trying to find the list of genes per sample that fit within a certain threshold. For example, in order to find the genes that have a value greater than 1001, I can use counts > 1001
which gives me a TRUE/FALSE matrix:
sample1 sample2 sample3 sample4 sample5
gene1 TRUE FALSE TRUE FALSE TRUE
gene2 FALSE FALSE FALSE FALSE FALSE
gene3 FALSE TRUE TRUE FALSE FALSE
gene4 FALSE TRUE TRUE FALSE TRUE
gene5 FALSE FALSE FALSE FALSE FALSE
Which I then pass to apply(true_false_matrix, 2, which) %>% lapply(\(x) names(x))
to get a list of the genes per sample that have a value greater than 1001. I would also like to find genes whose value is in between a certain range. For example, I tried to do:
1 < counts && counts < 5
But all I got was a single value of FALSE
I know that there are genes meeting this requirement so I think I'm going about finding them in the wrong way. Is there a way to get a TRUE/FALSE matrix from my initial dataframe but with 2 conditions?
CodePudding user response:
First, you need &
instead of &&
,
1 < counts & counts < 1000
# sample1 sample2 sample3 sample4 sample5
# gene1 FALSE TRUE FALSE TRUE FALSE
# gene2 TRUE TRUE TRUE TRUE TRUE
# gene3 TRUE FALSE FALSE TRUE TRUE
# gene4 TRUE FALSE FALSE TRUE FALSE
# gene5 TRUE TRUE TRUE TRUE TRUE
Second, you can use which
directly, by adding arr.ind=TRUE
for row/column indexes of all the TRUE
parts:
ind <- which(1 < counts & counts < 1000, arr.ind = TRUE)
ind
# row col
# gene2 2 1
# gene3 3 1
# gene4 4 1
# gene5 5 1
# gene1 1 2
# gene2 2 2
# gene5 5 2
# gene2 2 3
# gene5 5 3
# gene1 1 4
# gene2 2 4
# gene3 3 4
# gene4 4 4
# gene5 5 4
# gene2 2 5
# gene3 3 5
# gene5 5 5
While the row names here are clear, you can get which genes are affected two ways:
rownames(ind)
# [1] "gene2" "gene3" "gene4" "gene5" "gene1" "gene2" "gene5" "gene2" "gene5" "gene1" "gene2" "gene3" "gene4" "gene5"
# [15] "gene2" "gene3" "gene5"
rownames(counts)[ ind[,"row"] ]
# [1] "gene2" "gene3" "gene4" "gene5" "gene1" "gene2" "gene5" "gene2" "gene5" "gene1" "gene2" "gene3" "gene4" "gene5"
# [15] "gene2" "gene3" "gene5"
And since you said you wanted to do something per-sample, you can do
split(rownames(ind), colnames(counts)[ ind[,"col"] ])
# $sample1
# [1] "gene2" "gene3" "gene4" "gene5"
# $sample2
# [1] "gene1" "gene2" "gene5"
# $sample3
# [1] "gene2" "gene5"
# $sample4
# [1] "gene1" "gene2" "gene3" "gene4" "gene5"
# $sample5
# [1] "gene2" "gene3" "gene5"
CodePudding user response:
The double &&
might be the issue, since it stops evaluation if LHS is not TRUE
, use a single one.
Your code you can write more concise:
apply(counts > 1001, 2, which) |> lapply(names)
apply(counts > 1001 & counts < 1200, 2, which) |> lapply(names)