rowSums if at least 1 group (set of columns) has greater than N counts in all replicates-CodePudding

I am working with RNA-seq data. I want to filter out genes where there are fewer than N counts in both replicates in at least one of my treatment groups.

My data is in a DESeq object, and the count data is structured like this, where each row is the gene, and each column a different sample. Sample names have the structure X|N|A|1/2 (where X is the cell line used, N is a 1 or 2 digit number reflecting treatment length, A is a letter representing the treatment group, and 1 or 2 indicates which replicate.

X1A1 <- c(117, 24, 45, 146, 1)
X1A2 <- c(129, 31, 58, 159, 0)
X1B1 <- c(136, 25, 50, 1293, 0)
X1B2 <- c(131, 24, 50, 1073, 4)
X1C1 <- c(113, 23, 43, 132, 0)
X1C2 <- c(117, 18, 43, 126, 0)
X1D1 <- c(101, 20, 0, 875, 1)
X1D2 <- c(99, 21, 38 , 844, 0)
X24A1 <- c(109, 17, 60, 95, 0)
X24A2 <- c(122, 14, 611, 90, 0)

df <- data.frame(X1A1, X1A2, X1B1, X1B2, X1C1, X1C2, X1D1, X1D2, X24A1, X24A2)
rownames(df) <- c("geneA", "geneB", "geneC", "geneD", "geneE")
df

Maybe I'm just not using the right search terms, but I can't figure out how to get what I need.

Right now I only know how to filter out genes that are not expressed below some threshold in all samples. For example, filtering out genes which are not expressed at all.

keep1 <- rowSums(df) > 1
df1 <- df[keep1,]

What I want is to refine this so that I would end up discarding geneE in my example set, because no group has counts above 0 for both replicates.

df2 <- df[1:4,]
df2

CodePudding user response：

The logic should be applied on the 'df' itself to create a logical matrix, then when we do rowSums, it counts the number of TRUE (or 1) values, then use that to do the second condition i.e. at least more than one TRUE (> 1)

df[rowSums(df > 1) > 1,]

-output

      X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
geneA  117  129  136  131  113  117  101   99   109   122
geneB   24   31   25   24   23   18   20   21    17    14
geneC   45   58   50   50   43   43    0   38    60   611
geneD  146  159 1293 1073  132  126  875  844    95    90

CodePudding user response：

Here is another option. If your data is always structured the way you show, then this function will evaluate that all replicates have at least 1 value greater than N.

library(tidyverse)

#example Data
df
#>       X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA  117  129  136  131  113  117  101   99   109   122
#> geneB   24   31   25   24   23   18   20   21    17    14
#> geneC   45   58   50   50   43   43    0   38    60   611
#> geneD  146  159 1293 1073  132  126  875  844    95    90
#> geneE    1    0    0    4    0    0    1    0     0     0


filter_fewer <- function(data, N){
  gene_list <- data |>
    rownames_to_column("gene") |>
    pivot_longer(cols = -gene, 
                 names_to = c("var", "type", "rep"),
                 names_pattern = "(\\w\\d )(\\w)(\\d )") |>
    pivot_wider(names_from = rep, values_from = value, names_glue = "rep_{rep}") |>
    group_by(gene, var, type) |>
    mutate(flag = if_any(c(rep_1, rep_2), ~.>N)) |>
    ungroup() |>
    group_by(gene) |>
    filter(sum(flag) == n()) |>
    pull(gene)|>
    unique()

  filter(data, row.names(data) %in% gene_list)
}

#your example
filter_fewer(df, 1)
#>       X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA  117  129  136  131  113  117  101   99   109   122
#> geneB   24   31   25   24   23   18   20   21    17    14
#> geneC   45   58   50   50   43   43    0   38    60   611
#> geneD  146  159 1293 1073  132  126  875  844    95    90

#remove geneB because neither A24A rep 1 or 2 are > 20
filter_fewer(df, 20)
#>       X1A1 X1A2 X1B1 X1B2 X1C1 X1C2 X1D1 X1D2 X24A1 X24A2
#> geneA  117  129  136  131  113  117  101   99   109   122
#> geneC   45   58   50   50   43   43    0   38    60   611
#> geneD  146  159 1293 1073  132  126  875  844    95    90