I am working with RNA-seq data. I want to filter out genes where there are fewer than N counts in both replicates in at least one of my treatment groups.
My data is in a DESeq object, and the count data is structured like this, where each row is the gene, and each column a different sample. Sample names have the structure X|N|A|1/2 (where X is the cell line used, N is a 1 or 2 digit number reflecting treatment length, A is a letter representing the treatment group, and 1 or 2 indicates which replicate.
X1A1 <- c(117, 24, 45, 146, 1)
X1A2 <- c(129, 31, 58, 159, 0)
X1B1 <- c(136, 25, 50, 1293, 0)
X1B2 <- c(131, 24, 50, 1073, 4)
X1C1 <- c(113, 23, 43, 132, 0)
X1C2 <- c(117, 18, 43, 126, 0)
X1D1 <- c(101, 20, 0, 875, 1)
X1D2 <- c(99, 21, 38 , 844, 0)
X24A1 <- c(109, 17, 60, 95, 0)
X24A2 <- c(122, 14, 611, 90, 0)
df <- data.frame(X1A1, X1A2, X1B1, X1B2, X1C1, X1C2, X1D1, X1D2, X24A1, X24A2)
rownames(df) <- c("geneA", "geneB", "geneC", "geneD", "geneE")
df
Maybe I'm just not using the right search terms, but I can't figure out how to get what I need.
Right now I only know how to filter out genes that are not expressed below some threshold in all samples. For example, filtering out genes which are not expressed at all.
keep1 <- rowSums(df) > 1
df1 <- df[keep1,]
What I want is to refine this so that I would end up discarding geneE in my example set, because no group has counts above 0 for both replicates.
df2 <- df[1:4,]
df2
CodePudding user response:
The logic should be applied on the 'df' itself to create a logical matrix, then when we do rowSums
, it counts the number of TRUE (or 1) values, then use that to do the second condition i.e. at least more than one TRUE (> 1
)
df[rowSums(df > 1) > 1,]
-output
X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
geneA 117 129 136 131 113 117 101 99 109 122
geneB 24 31 25 24 23 18 20 21 17 14
geneC 45 58 50 50 43 43 0 38 60 611
geneD 146 159 1293 1073 132 126 875 844 95 90
CodePudding user response:
Here is another option. If your data is always structured the way you show, then this function will evaluate that all replicates have at least 1 value greater than N.
library(tidyverse)
#example Data
df
#> X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA 117 129 136 131 113 117 101 99 109 122
#> geneB 24 31 25 24 23 18 20 21 17 14
#> geneC 45 58 50 50 43 43 0 38 60 611
#> geneD 146 159 1293 1073 132 126 875 844 95 90
#> geneE 1 0 0 4 0 0 1 0 0 0
filter_fewer <- function(data, N){
gene_list <- data |>
rownames_to_column("gene") |>
pivot_longer(cols = -gene,
names_to = c("var", "type", "rep"),
names_pattern = "(\\w\\d )(\\w)(\\d )") |>
pivot_wider(names_from = rep, values_from = value, names_glue = "rep_{rep}") |>
group_by(gene, var, type) |>
mutate(flag = if_any(c(rep_1, rep_2), ~.>N)) |>
ungroup() |>
group_by(gene) |>
filter(sum(flag) == n()) |>
pull(gene)|>
unique()
filter(data, row.names(data) %in% gene_list)
}
#your example
filter_fewer(df, 1)
#> X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA 117 129 136 131 113 117 101 99 109 122
#> geneB 24 31 25 24 23 18 20 21 17 14
#> geneC 45 58 50 50 43 43 0 38 60 611
#> geneD 146 159 1293 1073 132 126 875 844 95 90
#remove geneB because neither A24A rep 1 or 2 are > 20
filter_fewer(df, 20)
#> X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA 117 129 136 131 113 117 101 99 109 122
#> geneC 45 58 50 50 43 43 0 38 60 611
#> geneD 146 159 1293 1073 132 126 875 844 95 90